Defensive verifier framework and helpers for Harbor evaluations
Project description
graded 🍳
graded is a defensive verifier and grading framework designed for agent evaluations (particularly for Harbor agent evaluations). It allows you to declare structured grading criteria, leverage LLM judges with automatic tracing, and safely manage evaluation artifacts.
Installation
Install graded directly from PyPI (or your internal registry):
pip install graded
Or with uv:
uv pip install graded
Quick Start
Create an evaluation script (e.g. verify.py) to grade a task workspace:
from pathlib import Path
from graded import Evaluator
# Initialize the evaluator
ev = Evaluator(
workspace="/workspace",
output_path="/logs/verifier/reward.json",
auto_save_artifacts=True
)
# 1. Declare a standard criterion
@ev.criterion(name="has_output_file", weight=1.0)
def check_output(workspace: Path) -> bool:
return (workspace / "output.txt").is_file()
# 2. Declare a fatal criterion (short-circuits score to 0.0 if failed)
@ev.criterion(name="no_syntax_errors", weight=2.0, fatal=True)
def check_syntax(workspace: Path) -> bool:
# return True or False (or float 0.0 - 1.0)
return True
# 3. Declare a fractional scoring criterion
@ev.criterion(name="test_pass_rate", weight=3.0)
def check_tests(workspace: Path) -> float:
# Returns a score between 0.0 and 1.0
return 0.8 # e.g., 80% of tests passed
# Run the evaluation and write outputs
if __name__ == "__main__":
ev.run()
Core Features
1. Criteria Declarations (@ev.criterion)
Define check functions using the @ev.criterion decorator.
name: The unique identifier for the criterion.weight: Relative weight of the score in the final weighted average calculation.fatal: If set toTrue, any score of0.0orFalseimmediately short-circuits the final score to0.0.- Return Value: Must return a
bool,int, orfloat. Anything else raises aValueError.
2. LLM Judge with Automatic Tracing
graded integrates with instructor to run structured, schema-validated LLM grading prompts, automatically logging prompt, parameters, response schema, and LLM responses to traces.json.
from pydantic import BaseModel, Field
class Rubric(BaseModel):
score: float = Field(description="Score between 0.0 and 1.0 based on correctness.")
reasoning: str = Field(description="Detailed reasoning for the score.")
# In your criterion:
result = ev.llm_judge(
model="google/gemini-3.5-flash",
response_model=Rubric,
system="You are a strict code correctness evaluator.",
prompt="Compare the student's solution in code.py with the requirements...",
)
print(f"LLM Score: {result.score}")
print(f"Reasoning: {result.reasoning}")
3. File & Artifact Management
Safely access files and copy evaluation artifacts to the logs directory for post-evaluation review.
ev.read_file(filename): Safely reads content as a string. Auto-saves a copy to artifacts.ev.load_json(filename): Safely parses JSON file content. Auto-saves a copy to artifacts.ev.save_file(filename, content): Save arbitrary text/data to the artifacts directory.ev.save_dir(dirname): Copy an entire directory from the workspace to the artifacts directory.ev.load_trajectory(path): Load and parse an agent's ATIFtrajectory.jsonfile into a typedTrajectoryobject.
Outputs
When ev.run() completes, the following files are written to the directory containing your configured output_path:
reward.json: A flat JSON dictionary containing the final calculatedrewardand the individual scores for each criterion:{ "reward": 0.75, "has_output_file": 1.0, "no_syntax_errors": 1.0, "test_pass_rate": 0.8 }
reward.txt: A text file containing just the final reward float value (e.g.0.7500\n).traces.json: A list of structured LLM calls made viaev.llm_judge, detailing inputs, responses, latencies, and metadata.metadata.json: (Optional) Contains evaluator-level and run-level metadata.artifacts/: Subfolder containing copy-back files preserved during the evaluation run.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graded-1.0.1.tar.gz.
File metadata
- Download URL: graded-1.0.1.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea6dff02f9d75153de32f104a92dfe9011013d0cafac3c06ea7e7f94a867cd90
|
|
| MD5 |
e29d0ecacc7a4b02fe24355f9623b6e7
|
|
| BLAKE2b-256 |
4ef20a88faa02442688d6c1b76f648d38bb11630ca3f8884e2b73674034c78ca
|
Provenance
The following attestation bundles were made for graded-1.0.1.tar.gz:
Publisher:
ci.yml on ivanleomk/graded
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
graded-1.0.1.tar.gz -
Subject digest:
ea6dff02f9d75153de32f104a92dfe9011013d0cafac3c06ea7e7f94a867cd90 - Sigstore transparency entry: 1787181816
- Sigstore integration time:
-
Permalink:
ivanleomk/graded@d87c4c0cc7f73e5c7d2028b03f06e8654307240f -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/ivanleomk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@d87c4c0cc7f73e5c7d2028b03f06e8654307240f -
Trigger Event:
push
-
Statement type:
File details
Details for the file graded-1.0.1-py3-none-any.whl.
File metadata
- Download URL: graded-1.0.1-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
269af787cccdbe691e455b46320417b3887e226922f84c844f50a442311fbf8d
|
|
| MD5 |
1ac347412ccd4b3587d81c296a86b856
|
|
| BLAKE2b-256 |
adfad2e26465e3a7b51dec552b47cadd2757577a49cc938f7318227677de3a5b
|
Provenance
The following attestation bundles were made for graded-1.0.1-py3-none-any.whl:
Publisher:
ci.yml on ivanleomk/graded
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
graded-1.0.1-py3-none-any.whl -
Subject digest:
269af787cccdbe691e455b46320417b3887e226922f84c844f50a442311fbf8d - Sigstore transparency entry: 1787181882
- Sigstore integration time:
-
Permalink:
ivanleomk/graded@d87c4c0cc7f73e5c7d2028b03f06e8654307240f -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/ivanleomk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@d87c4c0cc7f73e5c7d2028b03f06e8654307240f -
Trigger Event:
push
-
Statement type: