Inspect AI interface to Harbor tasks

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jjallaire

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Inspect Harbor

This package provides an interface to run Harbor tasks using Inspect AI.

Installation

Install from PyPI:

pip install inspect-harbor

Or with uv:

uv add inspect-harbor

For development installation, see the Development section.

Prerequisites

Before running Harbor tasks, ensure you have:

Python 3.12 or higher - Required by inspect_harbor
Docker installed and running - Required for execution when using Docker sandbox (default)
Model API keys - Set appropriate environment variables (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY)

Quick Start

The fastest way to get started is to run a dataset from the Harbor registry.

Evaluate with a Model

Run a Harbor dataset with any Inspect-compatible model:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="hello-world" \
  --model openai/gpt-5-mini

Or run a different dataset:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="terminal-bench-sample" \
  --model openai/gpt-5-mini

This command:

Loads the terminal-bench-sample@2.0 dataset (latest version) from the Harbor registry
Downloads and caches all 10 tasks
Solves the tasks with the default ReAct agent using GPT-5-mini
Executes in a Docker sandbox environment
Stores results in ./logs

Using the Python API

You can also run Harbor tasks programmatically using the Python API:

from inspect_ai import eval
from inspect_harbor import harbor

eval(
    harbor(dataset_name_version="hello-world"),
    model="openai/gpt-5-mini"
)

Understanding Harbor Tasks

What is a Harbor Task?

Harbor is a framework for building, evaluating, and optimizing agents and models in containerized environments. A Harbor task is a self-contained evaluation unit that includes an instruction, execution environment, scoring criteria, and optionally a reference solution.

For comprehensive details about Harbor tasks, see the Harbor documentation.

Harbor Task File Structure

A typical Harbor task directory contains the following components:

my_task/
├── instruction.md      # Task instructions/prompt shown to the agent
├── task.toml           # Metadata, timeouts, resource specs (CPU/memory/GPU), env vars
├── environment/        # Environment setup - Dockerfile or docker-compose.yaml
│   └── Dockerfile      # Docker environment spec (varies by sandbox provider)
├── solution/           # (Optional) Reference solution for sanity checking
│   ├── solve.sh        # Executable solution script used by Oracle solver
│   └── ...             # Supporting solution files and dependencies
└── tests/              # Verification and scoring
    ├── test.sh         # Test script executed by verifier
    └── ...             # Outputs reward.txt or reward.json to /logs/verifier/

Harbor to Inspect Mapping

Inspect Harbor bridges Harbor tasks to the Inspect AI evaluation framework using the following mappings:

Harbor Concept	Inspect Concept	Description
Harbor Task	`Sample`	A single evaluation instance with instructions and environment
Harbor Dataset	`Task`	A collection of related evaluation instances
instruction.md	`Sample.input`	The prompt/instructions given to the agent
environment/	`SandboxEnvironmentSpec`	Docker/environment configuration for isolated execution
tests/test.sh	`Scorer` (`inspect_harbor/harbor_scorer`)	Test script executed by the scorer to produce reward/metrics
solution/solve.sh	`Solver` (`inspect_harbor/oracle`)	Reference solution script executed by the Oracle solver for sanity checking
task.toml[metadata]	`Sample.metadata`	Task metadata: author, difficulty, category, tags
task.toml[verifier]	Scorer timeout/env vars	Timeout and environment configuration for scorer execution
task.toml[agent]	Agent solver env vars	Environment variables for agent execution. Agent timeout_sec is ignored.
task.toml[solution]	Oracle solver env vars	Environment variables to set when running the solution script
task.toml[environment]	`SandboxEnvironmentSpec.config`	Resource specifications (CPU, memory, storage, GPU, internet). Overwrites resource limits in `environment/docker-compose.yaml`

LLM Judges in Verification

Some Harbor tasks use LLM judges for verification (e.g., evaluating open-ended responses or code quality). These tasks specify the model in their task.toml:

[verifier.env]
MODEL_NAME = "claude-haiku-4-5"
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

The verifier script (tests/test.sh) uses these environment variables to call the LLM. Make sure to set the appropriate API key (e.g., ANTHROPIC_API_KEY) when running tasks with LLM judges.

Task Parameters

The following parameters configure the Inspect Harbor task interface. They can be used in Python by importing inspect_harbor.harbor or via the command line with inspect eval inspect_harbor/harbor -T <parameter>=<value>.

Parameter	Description	Default	Example
`path`	Local path to task/dataset directory, or task identifier for git tasks	`None`	`"/path/to/local_dataset_or_task"` or `"datasets/aime/aime_60"`
`task_git_url`	Git repository URL for downloading tasks	`None`	`"https://github.com/laude-institute/harbor-datasets.git"`
`task_git_commit_id`	Git commit ID to pin task version	`None`	`"414014c23ce4d32128073d12b057252c918cccf4"`
`registry_url`	Custom registry URL	`None` (uses Harbor registry)	`"https://raw.githubusercontent.com/laude-institute/harbor/refs/heads/main/registry.json"`
`registry_path`	Path to local registry	`None`	`"/path/to/local/registry.json"`
`dataset_name_version`	Dataset name and optional version (format: `name@version`). Omitted versions resolve to: `"head"` > highest semver > lexically last.	`None`	`"aime"` or `"aime@1.0"`
`dataset_task_names`	List of task names to include (supports glob patterns)	`None`	`'["aime_60", "aime_61"]'` or `'["aime*"]'`
`dataset_exclude_task_names`	List of task names to exclude (supports glob patterns)	`None`	`'["aime_60", "aime_61"]'` or `'["aime*"]'`
`n_tasks`	Maximum number of tasks to run. Preferred over `--max-samples`: only downloads n tasks instead of entire dataset.	`None`	`10`
`disable_verification`	Skip task verification checks	`False`	`true` or `false`
`overwrite_cache`	Force re-download and overwrite cached tasks. Works for both git tasks and registry datasets.	`False`	`true` or `false`
`sandbox_env_name`	Sandbox environment name	`"docker"`	`"modal"` or `"docker"`
`override_cpus`	Override the number of CPUs from `task.toml`	`None`	`4`
`override_memory_mb`	Override the memory (in MB) from `task.toml`	`None`	`16384`
`override_gpus`	Override the number of GPUs from `task.toml`	`None`	`1`

Note: These are task-specific parameters passed with -T. For additional inspect eval command-line flags (like --model, --message-limit, --epochs, --fail-on-error, --log-dir, --log-level, --max-tasks, etc.), see the Inspect eval CLI reference or Python API reference.

Harbor Registry

The Harbor registry is a centralized catalog of curated Harbor datasets and tasks. Inspect Harbor uses this registry to automatically download and resolve datasets, following the same behavior as Harbor.

Default Registry

By default, Inspect Harbor uses the official Harbor registry. When you specify a dataset_name_version, it automatically:

Looks up the dataset in the registry
Finds the corresponding GitHub repository
Downloads only the requested tasks (or all tasks if not filtered)
Caches them locally for future use

Example:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime@1.0" \
  -T dataset_task_names='["aime_60"]' \
  --model openai/gpt-5-mini

→ Resolves to harbor-datasets/aime version 1.0 and downloads only the aime_60 task

Custom Registries

You can use custom registries for private or organization-specific datasets:

Remote registry:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="my_dataset@1.0" \
  -T registry_url="https://github.com/myorg/registry.json" \
  --model openai/gpt-5-mini

Local registry:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="my_dataset@1.0" \
  -T registry_path="/path/to/local/registry.json" \
  --model openai/gpt-5-mini

Cache Management

Downloaded tasks are cached locally in ~/.harbor/cache/. To force a fresh download:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime@1.0" \
  -T overwrite_cache=true \
  --model openai/gpt-5-mini

To manually clear the entire cache:

rm -rf ~/.harbor/cache/

Usage

Agents and Solvers

Solvers are the execution components in Inspect AI. They can run agent scaffolds (like ReAct), execute solution scripts (like the Oracle solver), perform prompt engineering, and more. Both solvers and agents can be used to solve Harbor tasks.

Default Agent Scaffold

When no agent or solver is specified, Inspect Harbor provides a default agent scaffold for your model:

Agent Type: ReAct agent
Tools: bash(timeout=300), python(timeout=300), update_plan()
Compaction: CompactionEdit() for context window management

This default configuration is suitable for most Harbor tasks that require command execution and file manipulation.

Using Custom Agents

You can provide your own agent or solver implementation using the --solver flag:

Using a custom agent:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime" \
  --solver path/to/custom/agent.py@custom_agent \
  --model openai/gpt-5-mini

Using Inspect SWE agent framework:

First install the required package:

pip install inspect-swe

Note: Make sure you have your ANTHROPIC_API_KEY in a .env file or set as an environment variable.

Then use it via CLI:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="aime" \
  --solver inspect_swe/claude_code \
  --model anthropic/claude-sonnet-4-5

Or via Python API:

from inspect_ai import eval
from inspect_harbor import harbor
from inspect_swe import claude_code

eval(
    harbor(dataset_name_version="aime"),
    solver=claude_code(),
    model="anthropic/claude-sonnet-4-5"
)

Oracle Solver

The Oracle solver is useful for verifying that a dataset is correctly configured and solvable. It executes the task's reference solution (solution/solve.sh script) instead of using a model.

CLI usage:

inspect eval inspect_harbor/harbor \
  -T dataset_name_version="hello-world" \
  --solver inspect_harbor/oracle

Python API usage:

from inspect_ai import eval
from inspect_harbor import harbor, oracle

eval(
    harbor(dataset_name_version="hello-world"),
    solver=oracle()
)

For more details:

Task and Dataset Sources

In addition to the Harbor Registry (covered above), you can also load Harbor tasks from local filesystems or git repositories.

Parameter Combinations

There are four primary patterns for loading Harbor tasks:

Pattern	Required Parameters	Optional Parameters
Registry Dataset	`dataset_name_version`	`registry_url` or `registry_path` `dataset_task_names` `dataset_exclude_task_names` `n_tasks` `overwrite_cache`
Git Task	`path` `task_git_url`	`task_git_commit_id` `overwrite_cache`
Local Task	`path`	`disable_verification`
Local Dataset	`path`	`dataset_task_names` `dataset_exclude_task_names` `n_tasks` `disable_verification`

From Local Path

# Run a single local task or dataset
inspect eval inspect_harbor/harbor \
  -T path="/path/to/task_or_dataset/directory" \
  --model openai/gpt-5-mini

From Git Repository

# Download and run a task from a git repository
inspect eval inspect_harbor/harbor \
  -T path="datasets/aime/aime_6" \
  -T task_git_url="https://github.com/laude-institute/harbor-datasets.git" \
  -T task_git_commit_id="414014c23ce4d32128073d12b057252c918cccf4" \
  --model openai/gpt-5-mini

Overrides

Inspect Harbor supports overriding resource specifications from a task's task.toml configuration. This is useful when you need more resources than specified in the task configuration.

Default Values

Parameter	Default	Notes
`cpus`	1	Respects task config or override
`memory_mb`	6144 (6GB)	6GB minimum enforced (uses task config or override if ≥ 6GB)
`gpus`	0	Respects task config or override

Examples

For example, terminal-bench-sample tasks may require more memory than the default 6GB minimum when using agents like Claude Code:

CLI usage:

# Override memory for Claude Code agent
inspect eval inspect_harbor/harbor \
  -T dataset_name_version="terminal-bench-sample" \
  -T override_memory_mb=16384 \
  --solver inspect_swe/claude_code \
  --model anthropic/claude-sonnet-4-5

Python API usage:

from inspect_ai import eval
from inspect_harbor import harbor
from inspect_swe import claude_code

eval(
    harbor(
        dataset_name_version="terminal-bench-sample",
        override_memory_mb=16384,  # 16GB in MB
    ),
    solver=claude_code(),
    model="anthropic/claude-sonnet-4-5"
)

Development

Clone the repository and install development dependencies:

git clone https://github.com/meridianlabs-ai/inspect_harbor.git
cd inspect_harbor
make install  # Installs dependencies and sets up pre-commit hooks

Run tests and checks:

make check    # Run linting (ruff check + format) and type checking (pyright)
make test     # Run tests
make cov      # Run tests with coverage report

Clean up build artifacts:

make clean    # Remove cache and build artifacts

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jjallaire

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.5.2

May 12, 2026

0.5.1

May 8, 2026

0.5.0

May 7, 2026

0.4.17

Apr 21, 2026

0.4.16

Apr 8, 2026

0.4.15

Apr 1, 2026

0.4.14

Mar 31, 2026

0.4.13

Mar 25, 2026

0.4.12

Mar 24, 2026

0.4.11

Mar 20, 2026

0.4.10

Mar 16, 2026

0.4.9

Mar 11, 2026

0.4.8

Mar 5, 2026

0.4.7

Mar 3, 2026

0.4.6

Mar 2, 2026

0.4.5

Feb 25, 2026

0.4.4

Feb 23, 2026

0.4.3

Feb 18, 2026

0.4.2

Feb 14, 2026

0.4.1

Feb 12, 2026

0.4.0

Feb 12, 2026

This version

0.3.0

Feb 11, 2026

0.2.0

Feb 10, 2026

0.1.2.dev1 pre-release

Feb 10, 2026

0.1.1

Feb 10, 2026

0.1.1.dev2 pre-release

Feb 10, 2026

0.1.0

Feb 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inspect_harbor-0.3.0.tar.gz (16.3 kB view details)

Uploaded Feb 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inspect_harbor-0.3.0-py3-none-any.whl (17.5 kB view details)

Uploaded Feb 11, 2026 Python 3

File details

Details for the file inspect_harbor-0.3.0.tar.gz.

File metadata

Download URL: inspect_harbor-0.3.0.tar.gz
Upload date: Feb 11, 2026
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inspect_harbor-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`5626131281c1dfd7121b4d49e434345a26cebae5dee57dad9e7e1ab5229f07ec`
MD5	`e3b4321abf07bc1eaffea5cd5191663b`
BLAKE2b-256	`d511a8e5e3851a6023ce936d6bbd695d15cc0c0c28e06c7dbff9df2f640bc830`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inspect_harbor-0.3.0.tar.gz:

Publisher: release.yaml on meridianlabs-ai/inspect_harbor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inspect_harbor-0.3.0.tar.gz
- Subject digest: 5626131281c1dfd7121b4d49e434345a26cebae5dee57dad9e7e1ab5229f07ec
- Sigstore transparency entry: 941941374
- Sigstore integration time: Feb 11, 2026
Source repository:
- Permalink: meridianlabs-ai/inspect_harbor@b51cbe97f5829c1bacd576701611788f784203b4
- Branch / Tag: refs/heads/main
- Owner: https://github.com/meridianlabs-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@b51cbe97f5829c1bacd576701611788f784203b4
- Trigger Event: push

File details

Details for the file inspect_harbor-0.3.0-py3-none-any.whl.

File metadata

Download URL: inspect_harbor-0.3.0-py3-none-any.whl
Upload date: Feb 11, 2026
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inspect_harbor-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c6b668fbab9f87c460f52edbd4560628af43a2c5b47d1fc5c6bde8180f91da33`
MD5	`0525b8948c01357cd39a3d27db740ae0`
BLAKE2b-256	`32784a9db3a09afd8fa3df3246e6104bb92c80187cd088a3db47f371f38cdfe2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inspect_harbor-0.3.0-py3-none-any.whl:

Publisher: release.yaml on meridianlabs-ai/inspect_harbor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inspect_harbor-0.3.0-py3-none-any.whl
- Subject digest: c6b668fbab9f87c460f52edbd4560628af43a2c5b47d1fc5c6bde8180f91da33
- Sigstore transparency entry: 941941421
- Sigstore integration time: Feb 11, 2026
Source repository:
- Permalink: meridianlabs-ai/inspect_harbor@b51cbe97f5829c1bacd576701611788f784203b4
- Branch / Tag: refs/heads/main
- Owner: https://github.com/meridianlabs-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@b51cbe97f5829c1bacd576701611788f784203b4
- Trigger Event: push

inspect-harbor 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Inspect Harbor

Installation

Prerequisites

Quick Start

Evaluate with a Model

Using the Python API

Understanding Harbor Tasks

What is a Harbor Task?

Harbor Task File Structure

Harbor to Inspect Mapping

LLM Judges in Verification

Task Parameters

Harbor Registry

Default Registry

Custom Registries

Cache Management

Usage

Agents and Solvers

Default Agent Scaffold

Using Custom Agents

Oracle Solver

Task and Dataset Sources

Parameter Combinations

From Local Path

From Git Repository

Overrides

Default Values

Examples

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance