Skip to main content

Inspect AI interface to Harbor tasks

Project description

Inspect Harbor

This package provides an interface to run Harbor tasks using Inspect AI.

Installation

Using uv:

git clone https://github.com/meridianlabs-ai/inspect_harbor.git
cd inspect_harbor
uv sync

Using pip:

git clone https://github.com/meridianlabs-ai/inspect_harbor.git
cd inspect_harbor
pip install -e .

Prerequisites

Before running Harbor tasks, ensure you have:

  • Python 3.12 or higher - Required by inspect_harbor
  • Docker installed and running - Required for execution when using Docker sandbox (default)
  • Model API keys - Set appropriate environment variables (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY)

Understanding Harbor Tasks

What is a Harbor Task?

Harbor is a framework for building, evaluating, and optimizing agents and models in containerized environments. A Harbor task is a self-contained evaluation unit that includes an instruction, execution environment, scoring criteria, and optionally a reference solution.

For comprehensive details about Harbor tasks, see the Harbor documentation.

Harbor Task File Structure

A typical Harbor task directory contains the following components:

my_task/
├── instruction.md      # Task instructions/prompt shown to the agent
├── task.toml           # Metadata, timeouts, resource specs (CPU/memory/GPU), env vars
├── environment/        # Environment setup - Dockerfile or docker-compose.yaml
│   └── Dockerfile      # Docker environment spec (varies by sandbox provider)
├── solution/           # (Optional) Reference solution for sanity checking
│   ├── solve.sh        # Executable solution script used by Oracle solver
│   └── ...             # Supporting solution files and dependencies
└── tests/              # Verification and scoring
    ├── test.sh         # Test script executed by verifier
    └── ...             # Outputs reward.txt or reward.json to /logs/verifier/

Harbor to Inspect Mapping

Inspect Harbor bridges Harbor tasks to the Inspect AI evaluation framework using the following mappings:

Harbor Concept Inspect Concept Description
Harbor Task Sample A single evaluation instance with instructions and environment
Harbor Dataset Task A collection of related evaluation instances
instruction.md Sample.input The prompt/instructions given to the agent
environment/ SandboxEnvironmentSpec Docker/environment configuration for isolated execution
tests/test.sh Scorer (inspect_harbor/harbor_scorer) Test script executed by the scorer to produce reward/metrics
solution/solve.sh Solver (inspect_harbor/oracle) Reference solution script executed by the Oracle solver for sanity checking
task.toml[metadata] Sample.metadata Task metadata: author, difficulty, category, tags
task.toml[verifier] Scorer timeout/env vars Timeout and environment configuration for scorer execution
task.toml[agent] Task.time_limit Agent timeout per Harbor task. Mapped to Task.time_limit using the maximum value across all samples
task.toml[solution] Oracle solver env vars Environment variables to set when running the solution script
task.toml[environment] SandboxEnvironmentSpec.config Resource specifications (CPU, memory, storage, GPU, internet). Overwrites resource limits in environment/docker-compose.yaml

LLM Judges in Verification

Some Harbor tasks use LLM judges for verification (e.g., evaluating open-ended responses or code quality). These tasks specify the model in their task.toml:

[verifier.env]
MODEL_NAME = "claude-haiku-4-5"
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

The verifier script (tests/test.sh) uses these environment variables to call the LLM. Make sure to set the appropriate API key (e.g., ANTHROPIC_API_KEY) when running tasks with LLM judges.

Note: Most Harbor tasks use deterministic test scripts and don't require LLM judges.

Quick Start

The fastest way to get started is to run a task from the Harbor registry.

Evaluate with a Model

Run a Harbor task with any Inspect-compatible model:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime@1.0" \
  -T dataset_task_names='["aime_60"]' \
  --model openai/gpt-4o-mini

This command:

Note: To execute the whole dataset, omit the dataset_task_names task parameter.

Verify with Oracle Solver

Before evaluating with models, you can verify that a task is solvable using its reference solution:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime@1.0" \
  -T dataset_task_names='["aime_60"]' \
  --solver inspect_harbor/oracle

The Oracle solver executes the task's solution/solve.sh script to confirm the task is correctly configured and solvable.

Using the Python API

You can also run Harbor tasks programmatically using the Python API:

from inspect_ai import eval
from inspect_harbor import harbor

eval(
    harbor(
        dataset_name_version="aime@1.0",
        dataset_task_names=["aime_60"]
    ),
    model="openai/gpt-4o-mini"
)

With Oracle solver:

from inspect_ai import eval
from inspect_harbor import harbor, oracle

eval(
    harbor(
        dataset_name_version="aime@1.0",
        dataset_task_names=["aime_60"],
        solver=oracle()
    )
)

With custom parameters:

from inspect_ai import eval
from inspect_harbor import harbor

eval(
    harbor(path="/path/to/local/dataset"),
    model="openai/gpt-4o-mini",
    continue_on_fail=True,
    message_limit=100,
)

Harbor Registry

The Harbor registry is a centralized catalog of curated Harbor datasets and tasks. Inspect Harbor uses this registry to automatically download and resolve datasets, following the same behavior as Harbor.

Default Registry

By default, Inspect Harbor uses the official Harbor registry. When you specify a dataset_name_version, it automatically:

  1. Looks up the dataset in the registry
  2. Finds the corresponding GitHub repository
  3. Downloads only the requested tasks (or all tasks if not filtered)
  4. Caches them locally for future use

Example:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime@1.0" \
  -T dataset_task_names='["aime_60"]' \
  --model openai/gpt-4o-mini

→ Resolves to harbor-datasets/aime version 1.0 and downloads only the aime_60 task

Custom Registries

You can use custom registries for private or organization-specific datasets:

Remote registry:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="my_dataset@1.0" \
  -T registry_url="https://github.com/myorg/registry.json" \
  --model openai/gpt-4o-mini

Local registry:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="my_dataset@1.0" \
  -T registry_path="/path/to/local/registry.json" \
  --model openai/gpt-4o-mini

Cache Management

Downloaded tasks are cached locally. To force a fresh download:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime@1.0" \
  -T overwrite_cache=true \
  --model openai/gpt-4o-mini

Usage

Agents and Solvers

Solvers are the execution components in Inspect AI. They can run agent scaffolds (like ReAct), execute solution scripts (like the Oracle solver), perform prompt engineering, and more. Both solvers and agents can be used to solve Harbor tasks.

Default Agent Scaffold

When no agent or solver is specified, Inspect Harbor provides a default agent scaffold for your model:

This default configuration is suitable for most Harbor tasks that require command execution and file manipulation.

Using Custom Agents

You can provide your own agent or solver implementation using the --solver flag:

Using a custom agent:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime@1.0" \
  --solver path/to/custom/agent.py@custom_agent \
  --model openai/gpt-4o-mini

Using Inspect SWE agent framework:

First install the required package:

pip install inspect-swe

Then use it via CLI:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime@1.0" \
  --solver inspect_swe/claude_code \
  --model anthropic/claude-sonnet-4-5

Or via Python API:

from inspect_ai import eval
from inspect_harbor import harbor
from inspect_swe import claude_code

eval(
    harbor(dataset_name_version="aime@1.0"),
    solver=claude_code(),
    model="anthropic/claude-sonnet-4-5"
)

Note: Make sure you have your ANTHROPIC_API_KEY in a .env file or set as an environment variable.

For more details:

Task and Dataset Sources

In addition to the Harbor Registry (covered above), you can also load Harbor tasks from local filesystems or git repositories.

From Local Path

# Run a single local task or dataset
inspect eval inspect_harbor/harbor \
  -T path="/path/to/task_or_dataset/directory" \
  --model openai/gpt-4o-mini

From Git Repository

# Download and run a task from a git repository
inspect eval inspect_harbor/harbor \
  -T path="aime_60" \
  -T task_git_url="https://github.com/example/tasks.git" \
  -T task_git_commit_id="abc123" \
  --model openai/gpt-4o-mini

Task Parameters

The following parameters configure the Inspect Harbor task interface. They can be used in Python by importing inspect_harbor.harbor or via the command line with inspect eval inspect_harbor/harbor -T <parameter>=<value>.

Parameter Description Example
path Local path to task/dataset directory, or task identifier for git tasks "/path/to/task" or "aime_i-9"
task_git_url Git repository URL for downloading tasks "https://github.com/example/tasks.git"
task_git_commit_id Git commit ID to pin task version "abc123"
registry_url Custom registry URL (defaults to Harbor registry) "https://github.com/custom/registry.json"
registry_path Path to local registry "/path/to/registry.json"
dataset_name_version Dataset name and version (format: name@version) "aime@1.0"
dataset_task_names List of task names to include (supports glob patterns) '["aime_60", "aime_61"]' or '["aime*"]'
dataset_exclude_task_names List of task names to exclude (supports glob patterns) '["task1", "task2"]'
n_tasks Maximum number of tasks to run 10
disable_verification Skip task verification checks true or false
overwrite_cache Force re-download and overwrite cached tasks (default: false). Works for both git tasks and registry datasets. true or false
sandbox_env_name Sandbox environment name (default: "docker") "modal" or "docker"
solver Custom solver (defaults to ReAct agent with bash/python/memory/update_plan tools) inspect_harbor/oracle

Note: These are task-specific parameters passed with -T. For additional inspect eval command-line flags (like --model, --message-limit, --epochs, --fail-on-error, --log-dir, --log-level, --max-tasks, etc.), see the Inspect eval CLI reference or Python API reference.

Development

Install development dependencies:

make install  # Installs dependencies and sets up pre-commit hooks

Or manually using uv:

uv sync

Run tests and checks:

make check    # Run linting (ruff check + format) and type checking (pyright)
make test     # Run tests
make cov      # Run tests with coverage report

Clean up build artifacts:

make clean    # Remove cache and build artifacts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inspect_harbor-0.1.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inspect_harbor-0.1.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file inspect_harbor-0.1.0.tar.gz.

File metadata

  • Download URL: inspect_harbor-0.1.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inspect_harbor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 26872d59ceba628a33cb12034f18a9304bfcac486d31c9acf333b0be83bc968d
MD5 4454ccc29006e6fd154b767381e9c689
BLAKE2b-256 03f84a9b3b8172b25e81c2999a4fa05687c858d078e03b762e8309d8f2cbec7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for inspect_harbor-0.1.0.tar.gz:

Publisher: publish.yaml on meridianlabs-ai/inspect_harbor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inspect_harbor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: inspect_harbor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inspect_harbor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c4db1e2a3d186fe5b37513f67e07ef83111cb0a2771e6ea6a48aa170c299f196
MD5 39fc6a3138f413a251b4b8806116888c
BLAKE2b-256 cf1128eb455bf2253bd29a2c14bf138a50866730b7b053c5ac5cadc44d608751

See more details on using hashes here.

Provenance

The following attestation bundles were made for inspect_harbor-0.1.0-py3-none-any.whl:

Publisher: publish.yaml on meridianlabs-ai/inspect_harbor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page