firehorse-cli

Agent evaluation harness for OpenReward environments

Project description

Firehorse

Python 3.10+

🔥🐴 Firehorse is a library of agent harnesses for running models against OpenReward environments.

It bridges popular harnesses (Claude Code, Codex, Gemini CLI, ReAct, ReSum) with OpenReward, letting you sample agentic trajectories without setting up environment infrastructure. Firehorse manages concurrent trial execution and produces structured trajectory logs and aggregate results.

Note: This is an experimental library testing our new composable toolset features on OpenReward.

Quickstart

Install the firehorse library:

pip install firehorse-cli

Set up your environment variables - get an OpenReward key here:

export OPENREWARD_API_KEY=your-openreward-key
export OPENROUTER_API_KEY=your-openrouter-key # or other env if using diff model provider

Ensure you have the harness CLI installed (in this case Claude Code) and then run:

# Run Claude Code agent against an environment
firehorse \
  --env Eigent/SETA \
  --agent claude-code \
  --model openrouter/moonshotai/kimi-k2.6
  --split train
  --output-dir ./kimi-seta

Prerequisites

Python 3.10+
OpenReward API key — get one at openreward.ai
LLM provider API key — Anthropic, OpenAI, Google, or OpenRouter

For specific agents:

claude-code — requires Claude Code CLI installed (tested with v2.1.88)
codex — requires Codex CLI installed (tested with v0.133.0)
gemini — requires Gemini CLI installed (tested with v0.38.2)
hermes — requires Hermes Agent CLI installed (tested with v0.15.2). Install with uv tool install hermes-agent.

Agent Types

Agent	Description	Providers
`resum` (default)	ReAct loop with compaction when context fills up	Anthropic, OpenAI, Google, OpenRouter, custom
`claude-code`	Claude Code CLI with environment tools via MCP	Anthropic, OpenAI, Google, OpenRouter
`codex`	Codex CLI with environment tools via MCP	OpenAI
`gemini`	Gemini CLI with environment tools via MCP	Google
`hermes`	Hermes Agent CLI with environment tools via MCP	Anthropic, OpenAI, OpenRouter, custom OpenAI-compatible
`react`	Direct LLM API Reason-Act loop	Anthropic, OpenAI, Google, OpenRouter, custom

Thinking / Reasoning

The --effort flag controls how much thinking/reasoning the model does. It's supported by all agents. Omitting the flag (or passing --effort none) leaves the reasoning parameter unset so each provider uses its own default. The effort level maps to each provider's native thinking mechanism:

Provider	Mechanism	low	medium	high	max
Anthropic	Adaptive thinking (`effort` param)	low	medium	high	max (Opus only)
OpenAI	`reasoning_effort`	low	medium	high	xhigh
Google Gemini 3.x	`thinking_level`	low	medium	high	high
OpenRouter	Passes through to underlying provider	—	—	—	—

# High thinking (opt-in; default is no effort flag — provider picks its own default)
firehorse --env GeneralReasoning/CTF --model anthropic/claude-sonnet-4-6 --effort high

# Max thinking for deep reasoning tasks
firehorse --env Naman/R2E-Gym --agent codex --model openai/gpt-5.4 --split all --effort xhigh

# Low thinking for speed
firehorse --env collinear/YC-Bench --agent react --model google/gemini-3.1-flash-lite-preview --effort low

Each agent maps --effort to its provider's native parameter. Models that don't support thinking (e.g., GPT-4.1) ignore the flag.

CLI Reference

firehorse --env ENV --model MODEL [OPTIONS]

Required:
  --env              Environment name (e.g. MyOrg/my-environment)
  --model            Model identifier (e.g. anthropic/claude-sonnet-4-6)

Options:
  --agent            Agent type: claude-code, codex, gemini, hermes, react, resum (default: resum)
  --variant          Environment variant (e.g. 'mathnocode' for GeneralReasoning/MATH) (default: none)
  --split            Task split to evaluate (default: test)
  --n-concurrent     Max parallel trials (default: 1)
  --max-tasks        Limit number of tasks to evaluate
  --max-turns        Max tool call turns per trial
  --run-name         Name for this evaluation run
  --effort           Thinking effort: none, low, medium, high, max, xhigh (default: none — use model default)
  --provider-url     Custom API base URL for non-standard endpoints
  --output-dir       Directory for JSONL trajectory logs and results
  --secret KEY=VAL   Inject a session secret (repeatable)
  --disable-builtin-tools  Comma-separated list of tools to disable
  --use-env-descriptions   Use environment tool descriptions instead of built-in ones
  --use-all-filesystem-tools  Expose all filesystem tools via MCP (codex only)
  --no-logging       Disable OpenReward rollout streaming

Output

When --output-dir is specified, firehorse writes:

output_dir/
├── run_result.json          # Aggregate results across all trials
├── trial_0.jsonl            # Full agent trajectory
├── trial_0_result.json      # Per-trial summary (reward, tokens, cost, duration)
├── trial_0_rewards.jsonl    # Reward signal at each tool call
└── ...

Each trial result includes:

Field	Description
`reward`	Final reward score from the environment
`finished`	Whether the environment signaled task completion
`turns_used`	Number of tool call turns
`input_tokens`	Total input tokens consumed
`output_tokens`	Total output tokens consumed
`cost_usd`	Estimated API cost
`duration_seconds`	Wall-clock time

JSONL Trajectory Format

All agents share the same bookend events in trial_*.jsonl:

{"type": "openreward_prompt", "system_prompt": "...", "environment_prompt": "..."}
... agent-specific events ...
{"type": "openreward_summary", "task_spec": {...}, "env": "...", "model": "...", "usage": {...}}

The events in between depend on whether the agent is API-based or CLI-based.

API agents (react, resum) produce normalized firehorse events:

assistant — model response with text, reasoning, and tool calls
tool_call — tool invocation with name, arguments, and call ID (resum only; react embeds in the assistant event)
tool_result — tool output with explicit reward and finished fields

CLI agents (claude-code, codex, gemini, hermes) pass through the raw CLI stream format:

claude-code: Claude's stream-json events (assistant/user with message.content blocks)
codex: Codex's --json events (item.started/item.completed with nested item.type)
gemini: Gemini's stream-json events (message deltas, tool_use, tool_result)
hermes: Hermes runs in -Q quiet mode (no per-turn stream); the full transcript is exported post-hoc via hermes sessions export into trial_*_hermes_session.json.

Reward signals are available in the trial_*_rewards.jsonl sidecar file and in OpenReward rollouts.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Documentation

Full documentation is available at docs.openreward.ai.

Project details

Release history Release notifications | RSS feed

0.1.4

Jun 16, 2026

This version

0.1.3

Jun 4, 2026

0.1.2

Jun 4, 2026

0.1.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

firehorse_cli-0.1.3.tar.gz (132.1 kB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

firehorse_cli-0.1.3-py3-none-any.whl (121.8 kB view details)

Uploaded Jun 4, 2026 Python 3

File details

Details for the file firehorse_cli-0.1.3.tar.gz.

File metadata

Download URL: firehorse_cli-0.1.3.tar.gz
Upload date: Jun 4, 2026
Size: 132.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for firehorse_cli-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`9c04210246d2b096941f56f45c2c6bc21781233e5649a77099e389d23ebb9724`
MD5	`015c4381a2e215af4b680b4fba047a9d`
BLAKE2b-256	`3aeec7eeb9888bcc770c863cbe85acaf0f7208a88fd496aab4a4b4c77caf96f3`

See more details on using hashes here.

File details

Details for the file firehorse_cli-0.1.3-py3-none-any.whl.

File metadata

Download URL: firehorse_cli-0.1.3-py3-none-any.whl
Upload date: Jun 4, 2026
Size: 121.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for firehorse_cli-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11e7265299306a520cf7e06f42adb75ead192145c3423f2e28650555e2686dbb`
MD5	`6019895eb786cc818094f7aff2c9ae60`
BLAKE2b-256	`e57a62e447242641b0f41932ee2d8f578a9db62ad66f108fd9c0f9839af5e290`

See more details on using hashes here.

firehorse-cli 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Quickstart

Prerequisites

Agent Types

Thinking / Reasoning

CLI Reference

Output

JSONL Trajectory Format

Contributing

License

Documentation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes