Run OpenReward environments through the Inspect eval platform

These details have not been verified by PyPI

Project links

Project description

inspect-openreward

Run OpenReward environments as Inspect evals.

Provides an Inspect-native Dataset, Scorer, and a session-lifecycle wrapper solver for any OpenReward environment, so you get Inspect's eval harness, transcript viewer, metrics, and model abstraction for free — and OpenReward's tools, tasks, and rewards surface as first-class Inspect primitives. The wrapper takes an arbitrary inner solver chain, so you can keep the default react-style loop or plug in your own scaffolding (system_message, basic_agent, react, custom @solver, …) without touching session management.

Install

uv venv
uv sync

Set OPENREWARD_API_KEY and whichever model provider keys you need (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.).

Quickstart

See ./src/example/terminal_bench_2_verified.py for a runnable version with both a default, and a customer solver chain, task. ./src/example/terminal_bench_2_verified_example_output.txt contains a sample output from running the example against Kimi K2.6 Reasoning.

Benchmark results

To validate the integration end-to-end, we ran all 87 of the Terminal Bench 2 Verified environment, with Anthropic Sonnet 4.5 driving the Claude Code toolset, scoring 55.0%. This lines up closely with our September 2025 benchmark of the same combination, with the ~5% uplift likely reflecting six months of model and harness improvements.

Custom solver chains

openreward_solver is a wrapper: it owns session open/close, prompt injection, tool conversion + installation, and reward capture, and then runs whatever inner solver chain you hand it. That means any Inspect solver composition slots in — chain together system_message, prompt_template, chain_of_thought, self_critique, use_tools(..., append=True), basic_agent, react, or your own @solver — and the OpenReward session-bound tools remain available throughout:

from inspect_ai.solver import chain, generate, system_message

@task
def terminal_bench_2_verified_custom() -> Task:
    env = OpenReward().environments.get(name="GeneralReasoning/terminal-bench-2-verified")
    return Task(
        dataset=openreward_dataset(env, split="test", limit=10),
        solver=openreward_solver(
            env,
            chain(
                system_message("Think carefully before each tool call."),
                generate(tool_calls="loop"),
            ),
            toolset="claude-code"
        ),
        scorer=openreward_scorer(),
    )

Reward / finished capture is baked into the installed tools, so it keeps working no matter how the inner chain drives tool calls (generate(...), execute_tools(...) inside basic_agent, a custom loop, …).

What each piece does

`openreward_dataset(environment, split, limit=None, shuffle=False, seed=None)`

Builds an Inspect Dataset from an OpenReward environment split. Each Sample carries the underlying OpenReward Task in metadata; the prompt is resolved lazily by the solver (so image prompts work, and there's no network round-trip at dataset-construction time).

`openreward_solver(environment, solver=None, *, toolset=None, tool_choice="auto")`

Session-lifecycle wrapper around an arbitrary inner solver chain. Per sample it:

Opens environment.session(task=..., toolset=toolset).
Fetches session.get_prompt() and injects it as the user message (text and image blocks both supported).
Converts session.list_tools() into Inspect tools, auto-detecting the provider from state.model.api and sanitising the JSON schema via openreward.sanitize_tool_schema, and installs them on state.tools / state.tool_choice.
Runs the inner solver inside the open session. If solver=None (the default), runs generate(tool_calls="loop") — the react-style loop. Pass a Solver or a list[Solver] (composed via chain(...)) to customise.
Captures the terminal reward / finished from tool outputs into state.metadata for the scorer to read — regardless of how the inner chain invokes tools.
Closes the session on teardown.

Arguments

environment: the OpenReward Environment to open sessions against. The dataset should be built from the same environment.
solver: inner solver (or list[Solver], normalised via inspect_ai.solver.chain). Defaults to generate(tool_calls="loop").
toolset (keyword-only): optional OpenReward toolset name passed to environment.session(...) — e.g. "claude-code" for a harness-native bash tool surface in addition to the environment's own tools.
tool_choice (keyword-only): passed through to Inspect's state.tool_choice.

Inner chains can layer on further tools via use_tools(extra_tools, append=True) — the session-bound tools installed by the wrapper remain available.

`openreward_scorer()`

@scorer that reads the terminal reward captured by openreward_solver and returns an Inspect Score. Metrics: mean() and stderr(). Samples that never produced a reward (e.g. ran out of turns) score 0.0.

`openreward_tool_to_inspect(tool_spec, session, provider=None)`

Low-level helper for converting a single OpenReward ToolSpec into an Inspect Tool. The openreward_solver wrapper is built on top of this; use it directly if you want to bypass the wrapper and manage the session yourself.

Model provider mapping

The solver maps Inspect's model-provider identifier to OpenReward's Provider enum for schema sanitisation:

Inspect `state.model.api`	OpenReward provider
`openai`, `openai-api`	`openai`
`anthropic`	`anthropic`
`google`, `vertex`	`google`
`openrouter`	`openrouter`
anything else	`openai` (safest superset)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

May 8, 2026

0.2.0

Apr 24, 2026

0.1.1

Apr 24, 2026

0.1.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inspect_openreward-0.3.0.tar.gz (249.1 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inspect_openreward-0.3.0-py3-none-any.whl (10.7 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file inspect_openreward-0.3.0.tar.gz.

File metadata

Download URL: inspect_openreward-0.3.0.tar.gz
Upload date: May 8, 2026
Size: 249.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for inspect_openreward-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`86e288ba56509df1af093194f25d6035762506089ac2d57b85599679aa394344`
MD5	`f370805e7bcc068fb040484bdbaca827`
BLAKE2b-256	`5dc6e6ba3b44bb532d82d8c2e49a71932b8fc9cc712249e5c0b79cd2d5c966e4`

See more details on using hashes here.

File details

Details for the file inspect_openreward-0.3.0-py3-none-any.whl.

File metadata

Download URL: inspect_openreward-0.3.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for inspect_openreward-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b30a21469bcaa40b8d0e368a77270658ccd63e1f19eb94d0b7182cab3d6d8e39`
MD5	`b3ff5469a4c75106d734db40732ae041`
BLAKE2b-256	`e188d59a340599cb5648afa2163064b57c8b41d986855ed0546feb58acaf7b11`

See more details on using hashes here.

inspect-openreward 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

inspect-openreward

Install

Quickstart

Benchmark results

Custom solver chains

What each piece does

`openreward_dataset(environment, split, limit=None, shuffle=False, seed=None)`

`openreward_solver(environment, solver=None, *, toolset=None, tool_choice="auto")`

`openreward_scorer()`

`openreward_tool_to_inspect(tool_spec, session, provider=None)`

Model provider mapping

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes