Defensive verifier framework and helpers for Harbor evaluations
Project description
Graded
Graded is a library to make computing rewards simple, defensive, and structured for agent evaluations, particularly within Harbor environments. It provides tools to declare structured grading criteria, execute LLM judges with automatic tracing, and manage evaluation artifacts.
Installation
pip install graded
Or using uv:
uv pip install graded
Quick Start
Create an evaluation script (e.g. verify.py) to grade a task workspace:
from pathlib import Path
from graded import Evaluator
# Initialize the evaluator
ev = Evaluator(
workspace="/workspace",
output_path="/logs/verifier/reward.json",
auto_save_artifacts=True
)
# 1. Declare a standard criterion (workspace parameter is optional)
@ev.criterion(name="has_output_file", weight=1.0)
def check_output() -> bool:
return ev.file_exists("output.txt")
# 2. Declare a fatal criterion (takes workspace Path parameter to inspect files directly)
@ev.criterion(name="no_syntax_errors", weight=2.0, fatal=True)
def check_syntax(workspace: Path) -> bool:
# Use workspace parameter to inspect the files on disk
return (workspace / "src").is_dir()
# 3. Declare a fractional scoring criterion
@ev.criterion(name="test_pass_rate", weight=3.0)
def check_tests() -> float:
return 0.8 # Returns a score between 0.0 and 1.0
if __name__ == "__main__":
ev.run()
Core Features
1. Criteria Declarations (@ev.criterion)
Define check functions using the @ev.criterion decorator.
Check functions can optionally accept the workspace directory as a pathlib.Path parameter if they need to perform custom filesystem operations. If a function does not accept any arguments, it will be executed without the workspace parameter.
name: Unique identifier for the criterion.weight: Relative weight of the score in the final weighted average calculation.fatal: IfTrue, any score of0.0orFalseimmediately short-circuits the final score to0.0.- Return Value: Must return a
bool,int, orfloat.
Programmatic Registration (Without Decorators)
If you prefer to define functions normally, you can register them programmatically without using decorator syntax:
def check_output(workspace: Path) -> bool:
return (workspace / "output.txt").is_file()
# Register directly
ev.criterion("has_output_file", weight=1.0)(check_output)
2. LLM Judge with Automatic Tracing
Integrate with instructor to run structured, schema-validated LLM grading prompts. Prompt, parameters, response schema, and LLM responses are automatically logged to traces.json.
from pydantic import BaseModel, Field
class Rubric(BaseModel):
score: float = Field(description="Score between 0.0 and 1.0 based on correctness.")
reasoning: str = Field(description="Detailed reasoning for the score.")
# In your criterion:
result = ev.llm_judge(
model="google/gemini-3.5-flash",
response_model=Rubric,
system="You are a strict code correctness evaluator.",
prompt="Compare the student's solution in code.py with the requirements...",
)
# The return value is fully type-hinted as an instance of your Rubric class
print(result.score)
print(result.reasoning)
3. File & Artifact Management
Access files and copy evaluation artifacts to the logs directory safely:
ev.read_file(filename): Reads content as a string and auto-saves a copy to artifacts.ev.load_json(filename): Parses JSON file content and auto-saves a copy to artifacts.ev.save_file(filename, content): Saves arbitrary text to the artifacts directory.ev.save_dir(dirname): Copies an entire directory from the workspace to the artifacts directory.ev.load_trajectory(path): Loads and parses an agent's ATIFtrajectory.jsonfile.
Outputs
When ev.run() completes, the following files are written to the directory containing your configured output_path:
reward.json: Flat JSON dictionary containing the final calculatedrewardand individual scores.reward.txt: Text file containing just the final reward float value.traces.json: List of structured LLM calls made viaev.llm_judge.metadata.json: Optional metadata.artifacts/: Subfolder containing copy-back files preserved during the evaluation run.
Agent Skills
You can install the graded-verifier skill to teach your AI coding agents (such as Cursor or Claude Code) how to write robust graded verifiers:
npx skills add <github-username>/eval-helpers/.agents/skills/graded-verifier
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graded-1.0.5.tar.gz.
File metadata
- Download URL: graded-1.0.5.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3167a2ba7a20e1444ea0607763d941a60dc936c976706a256cc4c4333f044c0f
|
|
| MD5 |
b41d45e020704175265e4a53912171a4
|
|
| BLAKE2b-256 |
8a6df15f2f38e12abcceeac2b5f01acb5ea344cc57f6b4c912a24f0881ba57ac
|
Provenance
The following attestation bundles were made for graded-1.0.5.tar.gz:
Publisher:
ci.yml on ivanleomk/graded
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
graded-1.0.5.tar.gz -
Subject digest:
3167a2ba7a20e1444ea0607763d941a60dc936c976706a256cc4c4333f044c0f - Sigstore transparency entry: 1797631267
- Sigstore integration time:
-
Permalink:
ivanleomk/graded@9bb8b4f378f7e9faef65ff41d2e14edb0e23beb6 -
Branch / Tag:
refs/tags/v1.0.5 - Owner: https://github.com/ivanleomk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@9bb8b4f378f7e9faef65ff41d2e14edb0e23beb6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file graded-1.0.5-py3-none-any.whl.
File metadata
- Download URL: graded-1.0.5-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5567a90fd6e9e54f00d9fa766e474abb24c625b4fe23240d2c62595f840473c
|
|
| MD5 |
23df47752d4070166d473a52d1d1d0e0
|
|
| BLAKE2b-256 |
6176ee95790d86109a23a7bf081fc0cd0b8217cd2176c0a4b9977975847901a7
|
Provenance
The following attestation bundles were made for graded-1.0.5-py3-none-any.whl:
Publisher:
ci.yml on ivanleomk/graded
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
graded-1.0.5-py3-none-any.whl -
Subject digest:
d5567a90fd6e9e54f00d9fa766e474abb24c625b4fe23240d2c62595f840473c - Sigstore transparency entry: 1797631426
- Sigstore integration time:
-
Permalink:
ivanleomk/graded@9bb8b4f378f7e9faef65ff41d2e14edb0e23beb6 -
Branch / Tag:
refs/tags/v1.0.5 - Owner: https://github.com/ivanleomk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@9bb8b4f378f7e9faef65ff41d2e14edb0e23beb6 -
Trigger Event:
push
-
Statement type: