Skip to main content

Defensive verifier framework and helpers for Harbor evaluations

Project description

Graded

Graded is a library to make computing rewards simple, defensive, and structured for agent evaluations, particularly within Harbor environments. It provides tools to declare structured grading criteria, execute LLM judges with automatic tracing, and manage evaluation artifacts.

Installation

pip install graded

Or using uv:

uv pip install graded

Quick Start

Create an evaluation script (e.g. verify.py) to grade a task workspace:

from pathlib import Path
from graded import Evaluator

# Initialize the evaluator
ev = Evaluator(
    workspace="/workspace",
    output_path="/logs/verifier/reward.json",
    auto_save_artifacts=True
)

# 1. Declare a standard criterion
@ev.criterion(name="has_output_file", weight=1.0)
def check_output(workspace: Path) -> bool:
    return ev.file_exists("output.txt")

# 2. Declare a fatal criterion (short-circuits final score to 0.0 if failed)
@ev.criterion(name="no_syntax_errors", weight=2.0, fatal=True)
def check_syntax(workspace: Path) -> bool:
    return True

# 3. Declare a fractional scoring criterion
@ev.criterion(name="test_pass_rate", weight=3.0)
def check_tests(workspace: Path) -> float:
    return 0.8  # Returns a score between 0.0 and 1.0

if __name__ == "__main__":
    ev.run()

Core Features

1. Criteria Declarations (@ev.criterion)

Define check functions using the @ev.criterion decorator.

  • name: Unique identifier for the criterion.
  • weight: Relative weight of the score in the final weighted average calculation.
  • fatal: If True, any score of 0.0 or False immediately short-circuits the final score to 0.0.
  • Return Value: Must return a bool, int, or float.

2. LLM Judge with Automatic Tracing

Integrate with instructor to run structured, schema-validated LLM grading prompts. Prompt, parameters, response schema, and LLM responses are automatically logged to traces.json.

from pydantic import BaseModel, Field

class Rubric(BaseModel):
    score: float = Field(description="Score between 0.0 and 1.0 based on correctness.")
    reasoning: str = Field(description="Detailed reasoning for the score.")

# In your criterion:
result = ev.llm_judge(
    model="google/gemini-3.5-flash",
    response_model=Rubric,
    system="You are a strict code correctness evaluator.",
    prompt="Compare the student's solution in code.py with the requirements...",
)

# The return value is fully type-hinted as an instance of your Rubric class
print(result.score)
print(result.reasoning)

3. File & Artifact Management

Access files and copy evaluation artifacts to the logs directory safely:

  • ev.read_file(filename): Reads content as a string and auto-saves a copy to artifacts.
  • ev.load_json(filename): Parses JSON file content and auto-saves a copy to artifacts.
  • ev.save_file(filename, content): Saves arbitrary text to the artifacts directory.
  • ev.save_dir(dirname): Copies an entire directory from the workspace to the artifacts directory.
  • ev.load_trajectory(path): Loads and parses an agent's ATIF trajectory.json file.

Outputs

When ev.run() completes, the following files are written to the directory containing your configured output_path:

  1. reward.json: Flat JSON dictionary containing the final calculated reward and individual scores.
  2. reward.txt: Text file containing just the final reward float value.
  3. traces.json: List of structured LLM calls made via ev.llm_judge.
  4. metadata.json: Optional metadata.
  5. artifacts/: Subfolder containing copy-back files preserved during the evaluation run.

Agent Skills

You can install the graded-verifier skill to teach your AI coding agents (such as Cursor or Claude Code) how to write robust graded verifiers:

npx skills add <github-username>/eval-helpers/.agents/skills/graded-verifier

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graded-1.0.4.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graded-1.0.4-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file graded-1.0.4.tar.gz.

File metadata

  • Download URL: graded-1.0.4.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for graded-1.0.4.tar.gz
Algorithm Hash digest
SHA256 e71d76c238324f1ae1fcbbe1cf9b3d387434a17c902972dc715d9720f71fd278
MD5 d33bb20679a124ccd2d2ad2f5856e109
BLAKE2b-256 0d2f12cd48431e7035b2b6c08ee0975e2b71f938389d8a3dccabd27bc64c10c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for graded-1.0.4.tar.gz:

Publisher: ci.yml on ivanleomk/graded

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file graded-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: graded-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for graded-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d2336ec3b5e7ff0720979caf484a57e10062e7782b22d1dbde769b868f8b0937
MD5 4ea499c720c26a5f41df751ab3413b92
BLAKE2b-256 228d52ded943c52b8e2bce410edd20d104e2b5eaeb8ae8a01e13ed83fa594359

See more details on using hashes here.

Provenance

The following attestation bundles were made for graded-1.0.4-py3-none-any.whl:

Publisher: ci.yml on ivanleomk/graded

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page