Skip to main content

Run OpenReward environments through the Inspect eval platform

Project description

inspect-openreward

PyPI version Python

Run OpenReward environments as Inspect evals.

Provides an Inspect-native Dataset, Scorer, and a session-lifecycle wrapper solver for any OpenReward environment, so you get Inspect's eval harness, transcript viewer, metrics, and model abstraction for free — and OpenReward's tools, tasks, and rewards surface as first-class Inspect primitives. The wrapper takes an arbitrary inner solver chain, so you can keep the default react-style loop or plug in your own scaffolding (system_message, basic_agent, react, custom @solver, …) without touching session management.

Install

uv venv
uv sync

Set OPENREWARD_API_KEY and whichever model provider keys you need (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.).

Quickstart

See ./src/example/terminal_bench_2_verified.py for a runnable version with both a default, and a customer solver chain, task. ./src/example/terminal_bench_2_verified_example_output.txt contains a sample output from running the example against Kimi K2.6 Reasoning.

Benchmark results

To validate the integration end-to-end, we ran all 87 of the Terminal Bench 2 Verified environment, with Anthropic Sonnet 4.5 driving the Claude Code toolset, scoring 55.0%. This lines up closely with our September 2025 benchmark of the same combination, with the ~5% uplift likely reflecting six months of model and harness improvements.

Custom solver chains

openreward_solver is a wrapper: it owns session open/close, prompt injection, tool conversion + installation, and reward capture, and then runs whatever inner solver chain you hand it. That means any Inspect solver composition slots in — chain together system_message, prompt_template, chain_of_thought, self_critique, use_tools(..., append=True), basic_agent, react, or your own @solver — and the OpenReward session-bound tools remain available throughout:

from inspect_ai.solver import chain, generate, system_message

@task
def terminal_bench_2_verified_custom() -> Task:
    env = OpenReward().environments.get(name="GeneralReasoning/terminal-bench-2-verified")
    return Task(
        dataset=openreward_dataset(env, split="test", limit=10),
        solver=openreward_solver(
            env,
            chain(
                system_message("Think carefully before each tool call."),
                generate(tool_calls="loop"),
            ),
            toolset="claude-code"
        ),
        scorer=openreward_scorer(),
    )

Reward / finished capture is baked into the installed tools, so it keeps working no matter how the inner chain drives tool calls (generate(...), execute_tools(...) inside basic_agent, a custom loop, …).

What each piece does

openreward_dataset(environment, split, limit=None, shuffle=False, seed=None)

Builds an Inspect Dataset from an OpenReward environment split. Each Sample carries the underlying OpenReward Task in metadata; the prompt is resolved lazily by the solver (so image prompts work, and there's no network round-trip at dataset-construction time).

openreward_solver(environment, solver=None, *, toolset=None, tool_choice="auto")

Session-lifecycle wrapper around an arbitrary inner solver chain. Per sample it:

  1. Opens environment.session(task=..., toolset=toolset).
  2. Fetches session.get_prompt() and injects it as the user message (text and image blocks both supported).
  3. Converts session.list_tools() into Inspect tools, auto-detecting the provider from state.model.api and sanitising the JSON schema via openreward.sanitize_tool_schema, and installs them on state.tools / state.tool_choice.
  4. Runs the inner solver inside the open session. If solver=None (the default), runs generate(tool_calls="loop") — the react-style loop. Pass a Solver or a list[Solver] (composed via chain(...)) to customise.
  5. Captures the terminal reward / finished from tool outputs into state.metadata for the scorer to read — regardless of how the inner chain invokes tools.
  6. Closes the session on teardown.

Arguments

  • environment: the OpenReward Environment to open sessions against. The dataset should be built from the same environment.
  • solver: inner solver (or list[Solver], normalised via inspect_ai.solver.chain). Defaults to generate(tool_calls="loop").
  • toolset (keyword-only): optional OpenReward toolset name passed to environment.session(...) — e.g. "claude-code" for a harness-native bash tool surface in addition to the environment's own tools.
  • tool_choice (keyword-only): passed through to Inspect's state.tool_choice.

Inner chains can layer on further tools via use_tools(extra_tools, append=True) — the session-bound tools installed by the wrapper remain available.

openreward_scorer()

@scorer that reads the terminal reward captured by openreward_solver and returns an Inspect Score. Metrics: mean() and stderr(). Samples that never produced a reward (e.g. ran out of turns) score 0.0.

openreward_tool_to_inspect(tool_spec, session, provider=None)

Low-level helper for converting a single OpenReward ToolSpec into an Inspect Tool. The openreward_solver wrapper is built on top of this; use it directly if you want to bypass the wrapper and manage the session yourself.

Model provider mapping

The solver maps Inspect's model-provider identifier to OpenReward's Provider enum for schema sanitisation:

Inspect state.model.api OpenReward provider
openai, openai-api openai
anthropic anthropic
google, vertex google
openrouter openrouter
anything else openai (safest superset)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inspect_openreward-0.3.0.tar.gz (249.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inspect_openreward-0.3.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file inspect_openreward-0.3.0.tar.gz.

File metadata

  • Download URL: inspect_openreward-0.3.0.tar.gz
  • Upload date:
  • Size: 249.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for inspect_openreward-0.3.0.tar.gz
Algorithm Hash digest
SHA256 86e288ba56509df1af093194f25d6035762506089ac2d57b85599679aa394344
MD5 f370805e7bcc068fb040484bdbaca827
BLAKE2b-256 5dc6e6ba3b44bb532d82d8c2e49a71932b8fc9cc712249e5c0b79cd2d5c966e4

See more details on using hashes here.

File details

Details for the file inspect_openreward-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: inspect_openreward-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for inspect_openreward-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b30a21469bcaa40b8d0e368a77270658ccd63e1f19eb94d0b7182cab3d6d8e39
MD5 b3ff5469a4c75106d734db40732ae041
BLAKE2b-256 e188d59a340599cb5648afa2163064b57c8b41d986855ed0546feb58acaf7b11

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page