Skip to main content

Inspect AI interface to Harbor tasks

Project description

Inspect Harbor

This package provides an interface to run Harbor tasks using Inspect AI.

Installation

Install from PyPI:

pip install inspect-harbor

Or with uv:

uv add inspect-harbor

For development installation, see the Development section.

Prerequisites

Before running Harbor tasks, ensure you have:

  • Python 3.12 or higher - Required by inspect_harbor
  • Docker installed and running - Required for execution when using Docker sandbox (default)
  • Model API keys - Set appropriate environment variables (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY)

Quick Start

The fastest way to get started is to run a dataset from the Harbor registry.

Evaluate with a Model

CLI:

# Run hello-world dataset
inspect eval inspect_harbor/hello_world \
  --model openai/gpt-5-mini

# Run terminal-bench-sample dataset
inspect eval inspect_harbor/terminal_bench_sample \
  --model openai/gpt-5

Python API:

from inspect_ai import eval
from inspect_harbor import hello_world, terminal_bench_sample

# Run hello-world
eval(hello_world(), model="openai/gpt-5-mini")

# Run terminal-bench-sample
eval(terminal_bench_sample(), model="openai/gpt-5")

What this does:

Available Datasets

Inspect Harbor provides task functions for each dataset in the Harbor registry. You can import and use them directly:

from inspect_harbor import (
    terminal_bench,
    swebenchpro,
    swe_lancer_diamond,
    swebench_verified,
    # ... and many more
)

For a complete list of available datasets and versions (including swebenchpro, terminal-bench-pro, replicationbench, compilebench, and 40+ more), see REGISTRY.md.

Dataset Versioning

Each dataset has both unversioned and versioned task functions:

  • Unversioned functions (e.g., terminal_bench()) automatically use the latest version available in the registry
  • Versioned functions (e.g., terminal_bench_2_0()) pin to a specific version for reproducibility

Example:

from inspect_harbor import terminal_bench, terminal_bench_2_0

# Uses latest version (currently 2.0)
eval(terminal_bench(), model="openai/gpt-5-mini")

# Pins to version 2.0 explicitly
eval(terminal_bench_2_0(), model="openai/gpt-5-mini")

Agents and Solvers

Solvers are the execution components in Inspect AI. They can run agent scaffolds (like ReAct), execute solution scripts (like the Oracle solver), perform prompt engineering, and more. Both solvers and agents can be used to solve Harbor tasks.

Default Agent Scaffold

When no agent or solver is specified, Inspect Harbor provides a default agent scaffold for your model:

This default configuration is suitable for most Harbor tasks that require command execution and file manipulation.

Using Custom Agents

You can provide your own agent or solver implementation using the --solver flag:

Using a custom agent:

inspect eval inspect_harbor/terminal_bench \
  --solver path/to/custom/agent.py@custom_agent \
  --model openai/gpt-5

Using Inspect SWE agent framework:

First install the required package:

pip install inspect-swe

CLI:

inspect eval inspect_harbor/terminal_bench_sample \
  --solver inspect_swe/claude_code \
  --model anthropic/claude-sonnet-4-5

Python API:

from inspect_ai import eval
from inspect_harbor import terminal_bench_sample
from inspect_swe import claude_code

eval(
    terminal_bench_sample(),
    solver=claude_code(),
    model="anthropic/claude-sonnet-4-5"
)

For more details:

Task Parameters

Task functions (like terminal_bench(), swe_lancer_diamond(), etc.) accept the following parameters:

Parameter Description Default Python Example CLI Example
dataset_task_names List of task names to include (supports glob patterns) None ["aime_60", "aime_61"] '["aime_60"]'
dataset_exclude_task_names List of task names to exclude (supports glob patterns) None ["aime_60"] '["aime_60"]'
n_tasks Maximum number of tasks to run None 10 10
overwrite_cache Force re-download and overwrite cached tasks False True true
sandbox_env_name Sandbox environment name "docker" "modal" "modal"
override_cpus Override the number of CPUs from task.toml None 4 4
override_memory_mb Override the memory (in MB) from task.toml None 16384 16384
override_gpus Override the number of GPUs from task.toml None 1 1

Example

Here's an example showing how to use multiple parameters together:

CLI:

inspect eval inspect_harbor/terminal_bench_sample \
  -T n_tasks=5 \
  -T overwrite_cache=true \
  -T override_memory_mb=8192 \
  --model anthropic/claude-sonnet-4-5

Python API:

from inspect_ai import eval
from inspect_harbor import terminal_bench_sample

eval(
    terminal_bench_sample(
        n_tasks=5,
        overwrite_cache=True,
        override_memory_mb=8192,
    ),
    model="anthropic/claude-sonnet-4-5"
)

This example:

  • Limits to 5 tasks using n_tasks
  • Forces fresh download with overwrite_cache
  • Allocates 8GB of memory

Understanding Harbor Tasks

What is a Harbor Task?

Harbor is a framework for building, evaluating, and optimizing agents and models in containerized environments. A Harbor task is a self-contained evaluation unit that includes an instruction, execution environment, scoring criteria, and optionally a reference solution.

For comprehensive details about Harbor tasks, see the Harbor documentation.

Harbor Task File Structure

A typical Harbor task directory contains the following components:

my_task/
├── instruction.md      # Task instructions/prompt shown to the agent
├── task.toml           # Metadata, timeouts, resource specs (CPU/memory/GPU), env vars
├── environment/        # Environment setup - Dockerfile or docker-compose.yaml
│   └── Dockerfile      # Docker environment spec (varies by sandbox provider)
├── solution/           # (Optional) Reference solution for sanity checking
│   ├── solve.sh        # Executable solution script used by Oracle solver
│   └── ...             # Supporting solution files and dependencies
└── tests/              # Verification and scoring
    ├── test.sh         # Test script executed by verifier
    └── ...             # Outputs reward.txt or reward.json to /logs/verifier/

Harbor to Inspect Mapping

Inspect Harbor bridges Harbor tasks to the Inspect AI evaluation framework using the following mappings:

Harbor Concept Inspect Concept Description
Harbor Task Sample A single evaluation instance with instructions and environment
Harbor Dataset Task A collection of related evaluation instances
instruction.md Sample.input The prompt/instructions given to the agent
environment/ SandboxEnvironmentSpec Docker/environment configuration for isolated execution
tests/test.sh Scorer (inspect_harbor/harbor_scorer) Test script executed by the scorer to produce reward/metrics
solution/solve.sh Solver (inspect_harbor/oracle) Reference solution script executed by the Oracle solver for sanity checking
task.toml[metadata] Sample.metadata Task metadata: author, difficulty, category, tags
task.toml[verifier] Scorer timeout/env vars Timeout and environment configuration for scorer execution
task.toml[agent] Agent solver env vars Environment variables for agent execution. Agent timeout_sec is ignored.
task.toml[solution] Oracle solver env vars Environment variables to set when running the solution script
task.toml[environment] SandboxEnvironmentSpec.config Resource specifications (CPU, memory, storage, GPU, internet). Overwrites resource limits in environment/docker-compose.yaml

LLM Judges in Verification

Some Harbor tasks use LLM judges for verification (e.g., evaluating open-ended responses or code quality). These tasks specify the model in their task.toml:

[verifier.env]
MODEL_NAME = "claude-haiku-4-5"
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

The verifier script (tests/test.sh) uses these environment variables to call the LLM. Make sure to set the appropriate API key (e.g., ANTHROPIC_API_KEY) when running tasks with LLM judges.

Advanced

Oracle Solver

The Oracle solver is useful for verifying that a dataset is correctly configured and solvable. It executes the task's reference solution (solution/solve.sh script) instead of using a model.

CLI:

inspect eval inspect_harbor/hello_world \
  --solver inspect_harbor/oracle

Python API:

from inspect_ai import eval
from inspect_harbor import hello_world, oracle

eval(hello_world(), solver=oracle())

Generic Harbor Interface

For advanced use cases, you can use the generic harbor() interface directly. This provides access to all task loading options including custom registries, git repositories, and local paths.

Harbor Interface Parameters

The harbor() function accepts all parameters from the Task Parameters table plus additional parameters for advanced task loading:

Parameter Description Default Python Example CLI Example
path Local path to task/dataset directory, or task identifier for git tasks None "/path/to/local_dataset" "/path/to/local_dataset"
task_git_url Git repository URL for downloading tasks None "https://github.com/laude-institute/harbor-datasets.git" "https://github.com/..."
task_git_commit_id Git commit ID to pin task version None "414014c23ce4d32128073d12b057252c918cccf4" "414014c..."
registry_url Custom registry URL None (uses Harbor registry) "https://github.com/myorg/registry.json" "https://..."
registry_path Path to local registry None "/path/to/local/registry.json" "/path/to/local/registry.json"
dataset_name_version Dataset name and optional version (format: name@version). Omitted versions resolve to: "head" > highest semver > lexically last. None "aime" or "aime@1.0" "aime@1.0"
disable_verification Skip task verification checks False True true

Note: These are task-specific parameters passed with -T. For additional inspect eval command-line flags (like --model, --message-limit, --epochs, --fail-on-error, --log-dir, --log-level, --max-tasks, etc.), see the Inspect eval CLI reference or Python API reference.

Parameter Combinations

There are four primary patterns for loading Harbor tasks:

Pattern Required Parameters Optional Parameters
Registry Dataset dataset_name_version registry_url or registry_path
dataset_task_names
dataset_exclude_task_names
n_tasks
overwrite_cache
Git Task path
task_git_url
task_git_commit_id
overwrite_cache
Local Task path disable_verification
Local Dataset path dataset_task_names
dataset_exclude_task_names
n_tasks
disable_verification

Custom Registries

You can use custom registries for private or organization-specific datasets:

Remote registry:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="my_dataset@1.0" \
  -T registry_url="https://github.com/myorg/registry.json" \
  --model openai/gpt-5-mini

Local registry:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="my_dataset@1.0" \
  -T registry_path="/path/to/local/registry.json" \
  --model openai/gpt-5-mini

Loading from Git Repositories

You can load tasks directly from git repositories:

inspect eval inspect_harbor/harbor \
  -T path="datasets/aime/aime_6" \
  -T task_git_url="https://github.com/laude-institute/harbor-datasets.git" \
  -T task_git_commit_id="414014c23ce4d32128073d12b057252c918cccf4" \
  --model openai/gpt-5-mini

Loading from Local Paths

You can run tasks from your local filesystem:

inspect eval inspect_harbor/harbor \
  -T path="/path/to/task_or_dataset/directory" \
  --model openai/gpt-5

Cache Management

Downloaded tasks are cached locally in ~/.harbor/cache/. To force a fresh download:

inspect eval inspect_harbor/aime_1_0 \
  -T overwrite_cache=true \
  --model openai/gpt-5

To manually clear the entire cache:

rm -rf ~/.harbor/cache/

Development

Clone the repository and install development dependencies:

git clone https://github.com/meridianlabs-ai/inspect_harbor.git
cd inspect_harbor
make install  # Installs dependencies and sets up pre-commit hooks

Run tests and checks:

make check    # Run linting (ruff check + format) and type checking (pyright)
make test     # Run tests
make cov      # Run tests with coverage report

Clean up build artifacts:

make clean    # Remove cache and build artifacts

Credits

This work is based on contributions by @iphan and @anthonyduong9 from the inspect_evals repository:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inspect_harbor-0.4.7.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inspect_harbor-0.4.7-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file inspect_harbor-0.4.7.tar.gz.

File metadata

  • Download URL: inspect_harbor-0.4.7.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inspect_harbor-0.4.7.tar.gz
Algorithm Hash digest
SHA256 403d6d16c264cee1f712d2ebcdbe5a2e1b0fd86c4524725d5502a65430ff78c6
MD5 47f6f3f5e3e7ca95c2feaefa9aab23f1
BLAKE2b-256 99dcd369dbf16b7c064a6603f0962cd3e3525bb08d291962cf8dc84526b3fc75

See more details on using hashes here.

Provenance

The following attestation bundles were made for inspect_harbor-0.4.7.tar.gz:

Publisher: release.yaml on meridianlabs-ai/inspect_harbor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inspect_harbor-0.4.7-py3-none-any.whl.

File metadata

  • Download URL: inspect_harbor-0.4.7-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inspect_harbor-0.4.7-py3-none-any.whl
Algorithm Hash digest
SHA256 dfbe4947daf2d84634e181089cd341f4734179ae6422360ce3d8eaee6dd110a3
MD5 b4e6c421489d2d5806cfcc71d23c92c9
BLAKE2b-256 f502abd11b5de6c78419100ff075c91e8c6aaf28c7e3f53e4888e6b4ca4c52a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for inspect_harbor-0.4.7-py3-none-any.whl:

Publisher: release.yaml on meridianlabs-ai/inspect_harbor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page