Agent-as-judge grading framework for evaluating AI agent outputs against rubric criteria

These details have not been verified by PyPI

Project links

Project description

Gandalf the Grader

Gandalf the Grader is an Agent-as-a-Judge verifier for reinforcement learning environments. Gandalf provides an evaluation score / reward signal, grading the final state of the environment (including text response from agents, state of the filesystem, and state backing MCP tools) against a set of natural-language criteria specified in a rubric.

Unlike LLM-as-a-Judge or simple workflows (e.g., serialize-then-judge), Gandalf is capable of verifying outputs that are complex files, such as Excel or PowerPoint deliverables, and checking them against sophisticated rubric criteria, like "sensitivity table is properly constructed using Excel Data Table functionality or equivalent formula array." Gandalf is able to do this because it grades rubric criteria by running an AI agent within the RL environment itself.

See BankerToolBench as a complete example of Gandalf being used as the verifier in a production RL environment. See rle-pkg as a reference RL environment runtime that includes integration with Gandalf.

Gandalf is agnostic to RL environment framework/runtime; we have been using it mainly with Harbor.

Installation

Gandalf is published on PyPI.

uv tool install gandalf-the-grader

For production use, we recommend that you pin a specific version of Gandalf, and furthermore use the [pinned] version to pin all transitive dependencies.

uv tool install 'gandalf-the-grader[pinned]==1.0.0'

Quick start

Create a grader config (grader.toml):

model = "gemini/gemini-2.5-flash"
sandbox_user = "sandbox"
instructions = "Build a web app that displays hello world."
rubric_path = "/tests/rubric.json"
workdir = "/home/agent/workspace"
trajectory_path = "/logs/agent/trajectory.json"
output_dir = "/logs/grader"

Create a rubric (rubric.json):

[
  {"criterion": "The file index.html exists in the workspace", "weight": 1.0},
  {"criterion": "The page displays 'Hello World'", "weight": 2.0}
]

Run the grader:

gandalf-the-grader --config /tests/grader.toml

Configuration

`grader.toml`

Field	Required	Default	Description
`instructions`	Yes*		Inline task instructions given to the original agent (mutually exclusive with `instructions_path`)
`instructions_path`	Yes*		Path to a file with task instructions (mutually exclusive with `instructions`)
`rubric`	Yes*		Inline rubric as a TOML array of tables (mutually exclusive with `rubric_path`)
`rubric_path`	Yes*		Path to rubric JSON file (mutually exclusive with `rubric`)
`judge_guidance`	No		Inline judge guidance text (mutually exclusive with `judge_guidance_path`)
`judge_guidance_path`	No		Path to a file with extra judge instructions (mutually exclusive with `judge_guidance`)
`workdir`	Yes		Agent workspace directory
`trajectory_path`	Yes		Path to ATIF trajectory JSON
`output_dir`	Yes		Directory for grader output files
`model`	No	`gemini/gemini-2.5-flash`	LLM model for the judge agent
`mode`	No	`batch`	Evaluation mode: `batch` or `individual`
`judge_timeout`	No	`300`	Max seconds per judge invocation
`batch_timeout`	No		Max total seconds for batch mode (caps `judge_timeout * N`)
`judge_retries`	No	`1`	Number of retry attempts for criteria that error due to infrastructure failures
`batch_splits`	No		Split criteria into N chunks in batch mode (>= 2). Each chunk is evaluated as a separate batch session. Only valid with `mode = "batch"`.
`max_concurrency`	No		Max parallel judge sessions (>= 1). Defaults to 1 for individual mode, `batch_splits` for batch mode.
`sandbox_user`	No		Username for running the inner judge (via sudo). When omitted the judge runs as the current user.
`judge_prompt`	No		Inline Jinja2 template that completely overrides the built-in judge task prompt (mutually exclusive with `judge_prompt_path`)
`judge_prompt_path`	No		Path to a Jinja2 template file that completely overrides the built-in judge task prompt (mutually exclusive with `judge_prompt`)

MCP servers can be configured as TOML array of tables:

[[mcp_servers]]
name = "magic-server"
transport = "stdio"
command = "/usr/bin/mcp-server"
args = ["--verbose"]

Custom Judge Prompt

By default, the grader uses a built-in prompt template to kick off each judge session. judge_prompt / judge_prompt_path let you replace it entirely with a custom Jinja2 template.

Note: This prompt is sent as the opening user message to the judge agent, not the LLM system prompt. The underlying agent framework (OpenHands) has its own immutable system message with coding and tool-use instructions that we never modify. Our prompt sits on top of that as the first user turn, setting up the grading task.

For most use cases, judge_guidance / judge_guidance_path is all you need: it injects extra instructions into the built-in prompt without replacing it. Fully overriding the judge prompt is an uncommon escape hatch for situations where the built-in prompt structure itself is unsuitable.

The template receives these variables:

Variable	Type	Mode	Description
`instructions`	`str`	both	Task instructions given to the original agent
`final_output`	`str`	both	Agent's final message from the trajectory
`criterion`	`str`	individual	The single criterion string to evaluate
`criteria`	`list[str]`	batch	List of all criterion strings to evaluate
`verdict_path`	`str`	both	File path the judge must write its verdict to
`judge_guidance`	`str`	both	Additional guidance text (may be empty)

Individual and batch modes use separate built-in templates. In a custom template, use {% if criterion is defined %} vs {% if criteria is defined %} if you need to distinguish modes. In batch mode, use loop.index0 for the criterion index (e.g., {% for c in criteria %}[{{ loop.index0 }}] {{ c }}{% endfor %}).

Rubric JSON

A JSON array of objects with criterion (string) and weight (float). Weights can be negative to penalise undesired outcomes:

[
  {"criterion": "The output file exists", "weight": 2.0},
  {"criterion": "The output contains correct totals", "weight": 3.0},
  {"criterion": "The agent used hardcoded values instead of computing", "weight": -1.0}
]

Positive weight: adds to the raw score when the criterion's condition is met
Negative weight: deducts from the raw score when the criterion's condition is met (the bad thing happened)
The judge evaluates each criterion on its own merits; it never sees weights

Trajectory Format (ATIF)

The grader reads agent trajectories in Agent Trajectory Interchange Format (ATIF). An ATIF file is a JSON object with a steps array:

{
  "steps": [
    {"source": "user", "message": "Build a hello world web app"},
    {"source": "agent", "message": "I'll create the file now", "tool_calls": [...]},
    {"source": "agent", "message": "Done! I created index.html with a Hello World page."}
  ]
}

The grader extracts the final agent message (last "source": "agent" step with a non-empty message and no tool_calls) and passes it to the judge as context.

Environment Variables

Variable	Description
`LLM_API_KEY`	API key for the LLM provider
`LLM_BASE_URL`	Base URL for the LLM API (optional)
`GRADER_INSTRUCTIONS_PATH`	Fallback path to task instructions file (if not set in TOML)
`GRADER_JUDGE_GUIDANCE_PATH`	Fallback path to judge guidance file (if not set in TOML)
`GRADER_JUDGE_PROMPT_PATH`	Fallback path to custom judge prompt template (if not set in TOML)
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP endpoint URL for trace export (optional)
`OTEL_EXPORTER_OTLP_HEADERS`	OTLP auth headers, URL-encoded (optional)
`OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`	OTLP transport protocol, e.g. `http/protobuf` (optional)

Tracing / Observability

Gandalf builds on top of OpenHands, which has built-in OpenTelemetry tracing that automatically instruments LLM calls, tool executions, and agent steps. Set the OTEL_EXPORTER_OTLP_* variables above to export traces to any OTEL-compatible backend with no code changes required.

Example: Langfuse

# Encode your Langfuse keys
echo -n "pk-lf-...:sk-lf-..." | base64

# Export the variables
export OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otel/v1/traces
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic%20<base64-encoded-keys>"
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf

Output

The grader writes to output_dir:

reward.json: Reward file (e.g., {"reward": 0.75}) (always in [0, 1]). Only written when all criteria are successfully evaluated. If any criteria still have errors after retries, the grader writes info.json but skips reward.json and exits with code 1.
info.json: Always written. Per-criterion results with met/not-met, reasoning, evidence, LLM usage, plus reward, raw_score, minimum_score, maximum_score, errored_criterion_count, and evaluated_criteria_pct.
judge_trace_*.txt: stdout/stderr capture for each judge invocation. Naming varies by mode: judge_trace_{i}.txt (individual), judge_trace_batch.txt (batch), judge_trace_batch_split{i}.txt (batch with splits). Retries append a _retry{N} suffix.

The reward in reward.json is clip(0, 1, raw_score / sum_of_positive_weights), always in [0, 1]. info.json additionally includes raw_score (the raw sum of weights for met criteria, which can be negative) and minimum_score/maximum_score bounds for reference.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Apr 13, 2026

This version

0.5.1

Apr 13, 2026

0.0.1a0 pre-release

Mar 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gandalf_the_grader-0.5.1.tar.gz (335.7 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gandalf_the_grader-0.5.1-py3-none-any.whl (31.8 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file gandalf_the_grader-0.5.1.tar.gz.

File metadata

Download URL: gandalf_the_grader-0.5.1.tar.gz
Upload date: Apr 13, 2026
Size: 335.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.3 cpython/3.14.3 HTTPX/0.28.1

File hashes

Hashes for gandalf_the_grader-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`dcb141f8a42b3649e264513d0a98a2f6e6b0768df45c5d95bb7d4dbe437e2291`
MD5	`e53db0d96598943e877717af02e3df19`
BLAKE2b-256	`81181124223edeaa63fbc281a3ddee51c40717e83f63e54c9f03a1c563aa00f1`

See more details on using hashes here.

File details

Details for the file gandalf_the_grader-0.5.1-py3-none-any.whl.

File metadata

Download URL: gandalf_the_grader-0.5.1-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 31.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.3 cpython/3.14.3 HTTPX/0.28.1

File hashes

Hashes for gandalf_the_grader-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`97f1d6bc0bb51dc22f8dc6a954b01c38402cce54f3b4b0e08e2af83dd0a5fca9`
MD5	`b0c355d46c8fa46f0a0d2702797eac02`
BLAKE2b-256	`8c47458e30b998e10f4290f3e7f958b7efc5a3736ef7682d434ee6d2f8e6ad81`

See more details on using hashes here.

gandalf-the-grader 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Gandalf the Grader

Installation

Quick start

Configuration

`grader.toml`

Custom Judge Prompt

Rubric JSON

Trajectory Format (ATIF)

Environment Variables

Tracing / Observability

Output

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes