LLM-judge evaluators (G-Eval) for AgentForge

These details have not been verified by PyPI

Project links

Project description

agentforge-eval-geval

LLM-judge evaluators for AgentForge (feat-006).

This package ships the G-Eval engine — an LLM-as-judge grader that scores agent outputs against a rubric YAML — plus six named graders that wrap G-Eval with reference rubrics:

Grader	Scores
`Correctness`	Output matches the ground-truth answer (rubric-tunable for binary / Likert / ordinal)
`Faithfulness`	Output is supported by the retrieved evidence (no claims beyond what was retrieved)
`Groundedness`	Output stays inside the provided sources (no off-source content)
`Hallucination`	Output contains content not derivable from inputs (faithfulness + groundedness combined into a single risk score)
`Relevance`	Output addresses the user's question vs going off-topic
`Helpfulness`	Output is useful — actionable, complete, well-structured

Each grader takes a judge LLMClient at construction. The judge is typically a cheaper model than the agent's primary model (e.g. Haiku judging Sonnet output) — cuts cost and reduces "judge agrees with itself" bias.

Quick start

from agentforge import Agent
from agentforge_bedrock import BedrockClient
from agentforge_eval_geval import Correctness, Faithfulness

judge = BedrockClient(model_id="us.anthropic.claude-haiku-4-5")

agent = Agent(
    model="bedrock:us.anthropic.claude-sonnet-4-5-20250929",
    evaluators=[
        Correctness(judge=judge, ground_truth_field="expected"),
        Faithfulness(judge=judge, sources_field="retrieved_docs"),
    ],
)
result = await agent.run("Summarise PR #42")
print(result.eval_scores)

Cost-bounded: each judge call bills against the run's BudgetPolicy (feat-007). When the remaining budget falls below the grader's cost_estimate_usd, the agent skips the grader and logs a WARN.

Custom rubrics with G-Eval

from agentforge_eval_geval import GEval

grader = GEval(
    judge=judge,
    name="code-review-quality",
    rubric={
        "criteria": "Score the PR description's accuracy and completeness.",
        "scoring": "0.0 = incorrect/incomplete; 1.0 = accurate and complete",
        "examples": [
            {"output": "...", "score": 0.9, "reasoning": "..."},
        ],
    },
)

Or load from a YAML file:

grader = GEval.from_rubric_file("./rubrics/code-review.yaml", judge=judge)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.4

Jun 3, 2026

0.2.3

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentforge_eval_geval-0.2.4.tar.gz (14.6 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentforge_eval_geval-0.2.4-py3-none-any.whl (15.8 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file agentforge_eval_geval-0.2.4.tar.gz.

File metadata

Download URL: agentforge_eval_geval-0.2.4.tar.gz
Upload date: Jun 3, 2026
Size: 14.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentforge_eval_geval-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`1e7f44de7dba4150bfab520fdc816ded9e4566fd819af032c543a9529d6b7fdb`
MD5	`cf7b8421d7a4493f38aea5774aece98b`
BLAKE2b-256	`eb277aace0a072206d28e171f00e20a560b67945337c20f17ac728b3dd7ae07c`

See more details on using hashes here.

File details

Details for the file agentforge_eval_geval-0.2.4-py3-none-any.whl.

File metadata

Download URL: agentforge_eval_geval-0.2.4-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 15.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentforge_eval_geval-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2849ad814e506593ce1dda352ccade99f816a5990231eb5ed934a3dbf4b098a7`
MD5	`111e109a001e8d4bcaf45224f158d2c1`
BLAKE2b-256	`45d6959020feb3982877f92c520cf62df7e193cab20179700befdedfd67cf5a7`

See more details on using hashes here.

agentforge-eval-geval 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agentforge-eval-geval

Quick start

Custom rubrics with G-Eval

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes