Skip to main content

Agent evaluation harness for OpenReward environments

Project description

Firehorse

Docs Python 3.10+ License

🔥🐴 Firehorse is a library of agent harnesses for running models against OpenReward environments.

It bridges popular harnesses (Claude Code, Codex, Gemini CLI, ReAct, ReSum) with OpenReward, letting you sample agentic trajectories without setting up environment infrastructure. Firehorse manages concurrent trial execution and produces structured trajectory logs and aggregate results.

Note: This is an experimental library testing our new composable toolset features on OpenReward.

Quickstart

Install the firehorse library:

pip install firehorse-cli

Set up your environment variables - get an OpenReward key here:

export OPENREWARD_API_KEY=your-openreward-key
export OPENROUTER_API_KEY=your-openrouter-key # or other env if using diff model provider

Ensure you have the harness CLI installed (in this case Claude Code) and then run:

# Run Claude Code agent against an environment
firehorse \
  --env Eigent/SETA \
  --agent claude-code \
  --model openrouter/moonshotai/kimi-k2.6
  --split train
  --output-dir ./kimi-seta

Prerequisites

  • Python 3.10+
  • OpenReward API key — get one at openreward.ai
  • LLM provider API key — Anthropic, OpenAI, Google, or OpenRouter

For specific agents:

  • claude-code — requires Claude Code CLI installed (tested with v2.1.88)
  • codex — requires Codex CLI installed (tested with v0.133.0)
  • gemini — requires Gemini CLI installed (tested with v0.38.2)
  • hermes — requires Hermes Agent CLI installed (tested with v0.15.2). Install with uv tool install hermes-agent.

Agent Types

Agent Description Providers
resum (default) ReAct loop with compaction when context fills up Anthropic, OpenAI, Google, OpenRouter, custom
claude-code Claude Code CLI with environment tools via MCP Anthropic, OpenAI, Google, OpenRouter
codex Codex CLI with environment tools via MCP OpenAI
gemini Gemini CLI with environment tools via MCP Google
hermes Hermes Agent CLI with environment tools via MCP Anthropic, OpenAI, OpenRouter, custom OpenAI-compatible
react Direct LLM API Reason-Act loop Anthropic, OpenAI, Google, OpenRouter, custom

Thinking / Reasoning

The --effort flag controls how much thinking/reasoning the model does. It's supported by all agents. Omitting the flag (or passing --effort none) leaves the reasoning parameter unset so each provider uses its own default. The effort level maps to each provider's native thinking mechanism:

Provider Mechanism low medium high max
Anthropic Adaptive thinking (effort param) low medium high max (Opus only)
OpenAI reasoning_effort low medium high xhigh
Google Gemini 3.x thinking_level low medium high high
OpenRouter Passes through to underlying provider
# High thinking (opt-in; default is no effort flag — provider picks its own default)
firehorse --env GeneralReasoning/CTF --model anthropic/claude-sonnet-4-6 --effort high

# Max thinking for deep reasoning tasks
firehorse --env Naman/R2E-Gym --agent codex --model openai/gpt-5.4 --split all --effort xhigh

# Low thinking for speed
firehorse --env collinear/YC-Bench --agent react --model google/gemini-3.1-flash-lite-preview --effort low

Each agent maps --effort to its provider's native parameter. Models that don't support thinking (e.g., GPT-4.1) ignore the flag.

CLI Reference

firehorse --env ENV --model MODEL [OPTIONS]

Required:
  --env              Environment name (e.g. MyOrg/my-environment)
  --model            Model identifier (e.g. anthropic/claude-sonnet-4-6)

Options:
  --agent            Agent type: claude-code, codex, gemini, hermes, react, resum (default: resum)
  --variant          Environment variant (e.g. 'mathnocode' for GeneralReasoning/MATH) (default: none)
  --split            Task split to evaluate (default: test)
  --n-concurrent     Max parallel trials (default: 1)
  --max-tasks        Limit number of tasks to evaluate
  --max-turns        Max tool call turns per trial
  --run-name         Name for this evaluation run
  --effort           Thinking effort: none, low, medium, high, max, xhigh (default: none — use model default)
  --provider-url     Custom API base URL for non-standard endpoints
  --output-dir       Directory for JSONL trajectory logs and results
  --secret KEY=VAL   Inject a session secret (repeatable)
  --disable-builtin-tools  Comma-separated list of tools to disable
  --use-env-descriptions   Use environment tool descriptions instead of built-in ones
  --use-all-filesystem-tools  Expose all filesystem tools via MCP (codex only)
  --no-logging       Disable OpenReward rollout streaming

Output

When --output-dir is specified, firehorse writes:

output_dir/
├── run_result.json          # Aggregate results across all trials
├── trial_0.jsonl            # Full agent trajectory
├── trial_0_result.json      # Per-trial summary (reward, tokens, cost, duration)
├── trial_0_rewards.jsonl    # Reward signal at each tool call
└── ...

Each trial result includes:

Field Description
reward Final reward score from the environment
finished Whether the environment signaled task completion
turns_used Number of tool call turns
input_tokens Total input tokens consumed
output_tokens Total output tokens consumed
cost_usd Estimated API cost
duration_seconds Wall-clock time

JSONL Trajectory Format

All agents share the same bookend events in trial_*.jsonl:

{"type": "openreward_prompt", "system_prompt": "...", "environment_prompt": "..."}
... agent-specific events ...
{"type": "openreward_summary", "task_spec": {...}, "env": "...", "model": "...", "usage": {...}}

The events in between depend on whether the agent is API-based or CLI-based.

API agents (react, resum) produce normalized firehorse events:

  • assistant — model response with text, reasoning, and tool calls
  • tool_call — tool invocation with name, arguments, and call ID (resum only; react embeds in the assistant event)
  • tool_result — tool output with explicit reward and finished fields

CLI agents (claude-code, codex, gemini, hermes) pass through the raw CLI stream format:

  • claude-code: Claude's stream-json events (assistant/user with message.content blocks)
  • codex: Codex's --json events (item.started/item.completed with nested item.type)
  • gemini: Gemini's stream-json events (message deltas, tool_use, tool_result)
  • hermes: Hermes runs in -Q quiet mode (no per-turn stream); the full transcript is exported post-hoc via hermes sessions export into trial_*_hermes_session.json.

Reward signals are available in the trial_*_rewards.jsonl sidecar file and in OpenReward rollouts.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Documentation

Full documentation is available at docs.openreward.ai.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

firehorse_cli-0.1.3.tar.gz (132.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

firehorse_cli-0.1.3-py3-none-any.whl (121.8 kB view details)

Uploaded Python 3

File details

Details for the file firehorse_cli-0.1.3.tar.gz.

File metadata

  • Download URL: firehorse_cli-0.1.3.tar.gz
  • Upload date:
  • Size: 132.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for firehorse_cli-0.1.3.tar.gz
Algorithm Hash digest
SHA256 9c04210246d2b096941f56f45c2c6bc21781233e5649a77099e389d23ebb9724
MD5 015c4381a2e215af4b680b4fba047a9d
BLAKE2b-256 3aeec7eeb9888bcc770c863cbe85acaf0f7208a88fd496aab4a4b4c77caf96f3

See more details on using hashes here.

File details

Details for the file firehorse_cli-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: firehorse_cli-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 121.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for firehorse_cli-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 11e7265299306a520cf7e06f42adb75ead192145c3423f2e28650555e2686dbb
MD5 6019895eb786cc818094f7aff2c9ae60
BLAKE2b-256 e57a62e447242641b0f41932ee2d8f578a9db62ad66f108fd9c0f9839af5e290

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page