Skip to main content

LLM-judge evaluators (G-Eval) for AgentForge

Project description

agentforge-eval-geval

LLM-judge evaluators for AgentForge (feat-006).

This package ships the G-Eval engine — an LLM-as-judge grader that scores agent outputs against a rubric YAML — plus six named graders that wrap G-Eval with reference rubrics:

Grader Scores
Correctness Output matches the ground-truth answer (rubric-tunable for binary / Likert / ordinal)
Faithfulness Output is supported by the retrieved evidence (no claims beyond what was retrieved)
Groundedness Output stays inside the provided sources (no off-source content)
Hallucination Output contains content not derivable from inputs (faithfulness + groundedness combined into a single risk score)
Relevance Output addresses the user's question vs going off-topic
Helpfulness Output is useful — actionable, complete, well-structured

Each grader takes a judge LLMClient at construction. The judge is typically a cheaper model than the agent's primary model (e.g. Haiku judging Sonnet output) — cuts cost and reduces "judge agrees with itself" bias.

Quick start

from agentforge import Agent
from agentforge_bedrock import BedrockClient
from agentforge_eval_geval import Correctness, Faithfulness

judge = BedrockClient(model_id="us.anthropic.claude-haiku-4-5")

agent = Agent(
    model="bedrock:us.anthropic.claude-sonnet-4-5-20250929",
    evaluators=[
        Correctness(judge=judge, ground_truth_field="expected"),
        Faithfulness(judge=judge, sources_field="retrieved_docs"),
    ],
)
result = await agent.run("Summarise PR #42")
print(result.eval_scores)

Cost-bounded: each judge call bills against the run's BudgetPolicy (feat-007). When the remaining budget falls below the grader's cost_estimate_usd, the agent skips the grader and logs a WARN.

Custom rubrics with G-Eval

from agentforge_eval_geval import GEval

grader = GEval(
    judge=judge,
    name="code-review-quality",
    rubric={
        "criteria": "Score the PR description's accuracy and completeness.",
        "scoring": "0.0 = incorrect/incomplete; 1.0 = accurate and complete",
        "examples": [
            {"output": "...", "score": 0.9, "reasoning": "..."},
        ],
    },
)

Or load from a YAML file:

grader = GEval.from_rubric_file("./rubrics/code-review.yaml", judge=judge)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentforge_eval_geval-0.2.3.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentforge_eval_geval-0.2.3-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file agentforge_eval_geval-0.2.3.tar.gz.

File metadata

  • Download URL: agentforge_eval_geval-0.2.3.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agentforge_eval_geval-0.2.3.tar.gz
Algorithm Hash digest
SHA256 a6169e1f45f74f71836d7a0d1fa0729a1f6c2e24126693804f8ba3c506af4d8e
MD5 a546a4fab1380fa7f527763fc4859ae4
BLAKE2b-256 a87ce02505824db814a51ac9cb9947c6ac778bc108cb04c6a136da727ca98812

See more details on using hashes here.

File details

Details for the file agentforge_eval_geval-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: agentforge_eval_geval-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agentforge_eval_geval-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 86f31f182e9d0fcccd09e0cc5f6396db4e247a1e3f77dbb84da22c77f4eea5d1
MD5 5f03a5483fd7100ff013bcfe12656c8d
BLAKE2b-256 5ceb8e9f19a50927f32c74a8451233c28dbd68bfa5a71c326135e29367858718

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page