Inspect AI interface to Harbor tasks
Project description
Inspect Harbor
This package provides an interface to run Harbor tasks using Inspect AI.
Installation
Install from PyPI:
pip install inspect-harbor
Or with uv:
uv add inspect-harbor
For development installation, see the Development section.
Prerequisites
Before running Harbor tasks, ensure you have:
- Python 3.12 or higher - Required by inspect_harbor
- Docker installed and running - Required for execution when using Docker sandbox (default)
- Model API keys - Set appropriate environment variables (e.g.,
OPENAI_API_KEY,ANTHROPIC_API_KEY)
Quick Start
The fastest way to get started is to run a dataset from the Harbor registry.
Evaluate with a Model
CLI:
# Run hello-world dataset
inspect eval inspect_harbor/hello_world \
--model openai/gpt-5-mini
# Run terminal-bench-sample dataset
inspect eval inspect_harbor/terminal_bench_sample \
--model openai/gpt-5
Python API:
from inspect_ai import eval
from inspect_harbor import hello_world, terminal_bench_sample
# Run hello-world
eval(hello_world(), model="openai/gpt-5-mini")
# Run terminal-bench-sample
eval(terminal_bench_sample(), model="openai/gpt-5")
What this does:
- Loads the dataset from the Harbor registry
- Downloads and caches all tasks in the dataset
- Solves the tasks with the default ReAct agent scaffold
- Executes in a Docker sandbox environment
- Stores results in
./logs
Available Datasets
Inspect Harbor provides task functions for each dataset in the Harbor registry. You can import and use them directly:
from inspect_harbor import (
terminal_bench,
swebenchpro,
swe_lancer_diamond,
swebench_verified,
# ... and many more
)
For a complete list of available datasets and versions (including swebenchpro, terminal-bench-pro, replicationbench, compilebench, and 40+ more), see REGISTRY.md.
Dataset Versioning
Each dataset has both unversioned and versioned task functions:
- Unversioned functions (e.g.,
terminal_bench()) automatically use the latest version available in the registry - Versioned functions (e.g.,
terminal_bench_2_0()) pin to a specific version for reproducibility
Example:
from inspect_harbor import terminal_bench, terminal_bench_2_0
# Uses latest version (currently 2.0)
eval(terminal_bench(), model="openai/gpt-5-mini")
# Pins to version 2.0 explicitly
eval(terminal_bench_2_0(), model="openai/gpt-5-mini")
Agents and Solvers
Solvers are the execution components in Inspect AI. They can run agent scaffolds (like ReAct), execute solution scripts (like the Oracle solver), perform prompt engineering, and more. Both solvers and agents can be used to solve Harbor tasks.
Default Agent Scaffold
When no agent or solver is specified, Inspect Harbor provides a default agent scaffold for your model:
- Agent Type: ReAct agent
- Tools:
bash(timeout=300),python(timeout=300),update_plan() - Compaction:
CompactionEdit()for context window management
This default configuration is suitable for most Harbor tasks that require command execution and file manipulation.
Using Custom Agents
You can provide your own agent or solver implementation using the --solver flag:
Using a custom agent:
inspect eval inspect_harbor/terminal_bench \
--solver path/to/custom/agent.py@custom_agent \
--model openai/gpt-5
Using Inspect SWE agent framework:
First install the required package:
pip install inspect-swe
CLI:
inspect eval inspect_harbor/terminal_bench_sample \
--solver inspect_swe/claude_code \
--model anthropic/claude-sonnet-4-5
Python API:
from inspect_ai import eval
from inspect_harbor import terminal_bench_sample
from inspect_swe import claude_code
eval(
terminal_bench_sample(),
solver=claude_code(),
model="anthropic/claude-sonnet-4-5"
)
For more details:
Task Parameters
Task functions (like terminal_bench(), swe_lancer_diamond(), etc.) accept the following parameters:
| Parameter | Description | Default | Python Example | CLI Example |
|---|---|---|---|---|
dataset_task_names |
List of task names to include (supports glob patterns) | None |
["aime_60", "aime_61"] |
'["aime_60"]' |
dataset_exclude_task_names |
List of task names to exclude (supports glob patterns) | None |
["aime_60"] |
'["aime_60"]' |
n_tasks |
Maximum number of tasks to run | None |
10 |
10 |
overwrite_cache |
Force re-download and overwrite cached tasks | False |
True |
true |
sandbox_env_name |
Sandbox environment name | "docker" |
"modal" |
"modal" |
override_cpus |
Override the number of CPUs from task.toml |
None |
4 |
4 |
override_memory_mb |
Override the memory (in MB) from task.toml |
None |
16384 |
16384 |
override_gpus |
Override the number of GPUs from task.toml |
None |
1 |
1 |
Example
Here's an example showing how to use multiple parameters together:
CLI:
inspect eval inspect_harbor/terminal_bench_sample \
-T n_tasks=5 \
-T overwrite_cache=true \
-T override_memory_mb=8192 \
--model anthropic/claude-sonnet-4-5
Python API:
from inspect_ai import eval
from inspect_harbor import terminal_bench_sample
eval(
terminal_bench_sample(
n_tasks=5,
overwrite_cache=True,
override_memory_mb=8192,
),
model="anthropic/claude-sonnet-4-5"
)
This example:
- Limits to 5 tasks using
n_tasks - Forces fresh download with
overwrite_cache - Allocates 8GB of memory
Understanding Harbor Tasks
What is a Harbor Task?
Harbor is a framework for building, evaluating, and optimizing agents and models in containerized environments. A Harbor task is a self-contained evaluation unit that includes an instruction, execution environment, scoring criteria, and optionally a reference solution.
For comprehensive details about Harbor tasks, see the Harbor documentation.
Harbor Task File Structure
A typical Harbor task directory contains the following components:
my_task/
├── instruction.md # Task instructions/prompt shown to the agent
├── task.toml # Metadata, timeouts, resource specs (CPU/memory/GPU), env vars
├── environment/ # Environment setup - Dockerfile or docker-compose.yaml
│ └── Dockerfile # Docker environment spec (varies by sandbox provider)
├── solution/ # (Optional) Reference solution for sanity checking
│ ├── solve.sh # Executable solution script used by Oracle solver
│ └── ... # Supporting solution files and dependencies
└── tests/ # Verification and scoring
├── test.sh # Test script executed by verifier
└── ... # Outputs reward.txt or reward.json to /logs/verifier/
Harbor to Inspect Mapping
Inspect Harbor bridges Harbor tasks to the Inspect AI evaluation framework using the following mappings:
| Harbor Concept | Inspect Concept | Description |
|---|---|---|
| Harbor Task | Sample |
A single evaluation instance with instructions and environment |
| Harbor Dataset | Task |
A collection of related evaluation instances |
| instruction.md | Sample.input |
The prompt/instructions given to the agent |
| environment/ | SandboxEnvironmentSpec |
Docker/environment configuration for isolated execution |
| tests/test.sh | Scorer (inspect_harbor/harbor_scorer) |
Test script executed by the scorer to produce reward/metrics |
| solution/solve.sh | Solver (inspect_harbor/oracle) |
Reference solution script executed by the Oracle solver for sanity checking |
| task.toml[metadata] | Sample.metadata |
Task metadata: author, difficulty, category, tags |
| task.toml[verifier] | Scorer timeout/env vars | Timeout and environment configuration for scorer execution |
| task.toml[agent] | Agent solver env vars | Environment variables for agent execution. Agent timeout_sec is ignored. |
| task.toml[solution] | Oracle solver env vars | Environment variables to set when running the solution script |
| task.toml[environment] | SandboxEnvironmentSpec.config |
Resource specifications (CPU, memory, storage, GPU, internet). Overwrites resource limits in environment/docker-compose.yaml |
LLM Judges in Verification
Some Harbor tasks use LLM judges for verification (e.g., evaluating open-ended responses or code quality). These tasks specify the model in their task.toml:
[verifier.env]
MODEL_NAME = "claude-haiku-4-5"
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
The verifier script (tests/test.sh) uses these environment variables to call the LLM. Make sure to set the appropriate API key (e.g., ANTHROPIC_API_KEY) when running tasks with LLM judges.
Advanced
Oracle Solver
The Oracle solver is useful for verifying that a dataset is correctly configured and solvable. It executes the task's reference solution (solution/solve.sh script) instead of using a model.
CLI:
inspect eval inspect_harbor/hello_world \
--solver inspect_harbor/oracle
Python API:
from inspect_ai import eval
from inspect_harbor import hello_world, oracle
eval(hello_world(), solver=oracle())
Generic Harbor Interface
For advanced use cases, you can use the generic harbor() interface directly. This provides access to all task loading options including custom registries, git repositories, and local paths.
Harbor Interface Parameters
The harbor() function accepts all parameters from the Task Parameters table plus additional parameters for advanced task loading:
| Parameter | Description | Default | Python Example | CLI Example |
|---|---|---|---|---|
path |
Local path to task/dataset directory, or task identifier for git tasks | None |
"/path/to/local_dataset" |
"/path/to/local_dataset" |
task_git_url |
Git repository URL for downloading tasks | None |
"https://github.com/laude-institute/harbor-datasets.git" |
"https://github.com/..." |
task_git_commit_id |
Git commit ID to pin task version | None |
"414014c23ce4d32128073d12b057252c918cccf4" |
"414014c..." |
registry_url |
Custom registry URL | None (uses Harbor registry) |
"https://github.com/myorg/registry.json" |
"https://..." |
registry_path |
Path to local registry | None |
"/path/to/local/registry.json" |
"/path/to/local/registry.json" |
dataset_name_version |
Dataset name and optional version (format: name@version). Omitted versions resolve to: "head" > highest semver > lexically last. |
None |
"aime" or "aime@1.0" |
"aime@1.0" |
disable_verification |
Skip task verification checks | False |
True |
true |
Note: These are task-specific parameters passed with -T. For additional inspect eval command-line flags (like --model, --message-limit, --epochs, --fail-on-error, --log-dir, --log-level, --max-tasks, etc.), see the Inspect eval CLI reference or Python API reference.
Parameter Combinations
There are four primary patterns for loading Harbor tasks:
| Pattern | Required Parameters | Optional Parameters |
|---|---|---|
| Registry Dataset | dataset_name_version |
registry_url or registry_pathdataset_task_namesdataset_exclude_task_namesn_tasksoverwrite_cache |
| Git Task | pathtask_git_url |
task_git_commit_idoverwrite_cache |
| Local Task | path |
disable_verification |
| Local Dataset | path |
dataset_task_namesdataset_exclude_task_namesn_tasksdisable_verification |
Custom Registries
You can use custom registries for private or organization-specific datasets:
Remote registry:
inspect eval inspect_harbor/harbor \
-T dataset_name_version="my_dataset@1.0" \
-T registry_url="https://github.com/myorg/registry.json" \
--model openai/gpt-5-mini
Local registry:
inspect eval inspect_harbor/harbor \
-T dataset_name_version="my_dataset@1.0" \
-T registry_path="/path/to/local/registry.json" \
--model openai/gpt-5-mini
Loading from Git Repositories
You can load tasks directly from git repositories:
inspect eval inspect_harbor/harbor \
-T path="datasets/aime/aime_6" \
-T task_git_url="https://github.com/laude-institute/harbor-datasets.git" \
-T task_git_commit_id="414014c23ce4d32128073d12b057252c918cccf4" \
--model openai/gpt-5-mini
Loading from Local Paths
You can run tasks from your local filesystem:
inspect eval inspect_harbor/harbor \
-T path="/path/to/task_or_dataset/directory" \
--model openai/gpt-5
Cache Management
Downloaded tasks are cached locally in ~/.harbor/cache/. To force a fresh download:
inspect eval inspect_harbor/aime_1_0 \
-T overwrite_cache=true \
--model openai/gpt-5
To manually clear the entire cache:
rm -rf ~/.harbor/cache/
Development
Clone the repository and install development dependencies:
git clone https://github.com/meridianlabs-ai/inspect_harbor.git
cd inspect_harbor
make install # Installs dependencies and sets up pre-commit hooks
Run tests and checks:
make check # Run linting (ruff check + format) and type checking (pyright)
make test # Run tests
make cov # Run tests with coverage report
Clean up build artifacts:
make clean # Remove cache and build artifacts
Credits
This work is based on contributions by @iphan and @anthonyduong9 from the inspect_evals repository:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inspect_harbor-0.4.7.tar.gz.
File metadata
- Download URL: inspect_harbor-0.4.7.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
403d6d16c264cee1f712d2ebcdbe5a2e1b0fd86c4524725d5502a65430ff78c6
|
|
| MD5 |
47f6f3f5e3e7ca95c2feaefa9aab23f1
|
|
| BLAKE2b-256 |
99dcd369dbf16b7c064a6603f0962cd3e3525bb08d291962cf8dc84526b3fc75
|
Provenance
The following attestation bundles were made for inspect_harbor-0.4.7.tar.gz:
Publisher:
release.yaml on meridianlabs-ai/inspect_harbor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inspect_harbor-0.4.7.tar.gz -
Subject digest:
403d6d16c264cee1f712d2ebcdbe5a2e1b0fd86c4524725d5502a65430ff78c6 - Sigstore transparency entry: 1019508300
- Sigstore integration time:
-
Permalink:
meridianlabs-ai/inspect_harbor@ae07d5a16cdb6401924767504a089d4ea26b0c9a -
Branch / Tag:
refs/heads/main - Owner: https://github.com/meridianlabs-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@ae07d5a16cdb6401924767504a089d4ea26b0c9a -
Trigger Event:
push
-
Statement type:
File details
Details for the file inspect_harbor-0.4.7-py3-none-any.whl.
File metadata
- Download URL: inspect_harbor-0.4.7-py3-none-any.whl
- Upload date:
- Size: 23.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfbe4947daf2d84634e181089cd341f4734179ae6422360ce3d8eaee6dd110a3
|
|
| MD5 |
b4e6c421489d2d5806cfcc71d23c92c9
|
|
| BLAKE2b-256 |
f502abd11b5de6c78419100ff075c91e8c6aaf28c7e3f53e4888e6b4ca4c52a7
|
Provenance
The following attestation bundles were made for inspect_harbor-0.4.7-py3-none-any.whl:
Publisher:
release.yaml on meridianlabs-ai/inspect_harbor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inspect_harbor-0.4.7-py3-none-any.whl -
Subject digest:
dfbe4947daf2d84634e181089cd341f4734179ae6422360ce3d8eaee6dd110a3 - Sigstore transparency entry: 1019508303
- Sigstore integration time:
-
Permalink:
meridianlabs-ai/inspect_harbor@ae07d5a16cdb6401924767504a089d4ea26b0c9a -
Branch / Tag:
refs/heads/main - Owner: https://github.com/meridianlabs-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@ae07d5a16cdb6401924767504a089d4ea26b0c9a -
Trigger Event:
push
-
Statement type: