Skip to main content

Agent evaluation harness for OpenReward environments

Project description

Firehorse

Docs Python 3.10+ License

🔥🐴 Firehorse is a library of agent harnesses for running models against OpenReward environments.

It bridges popular harnesses (Claude Code, Codex, Gemini CLI, ReAct, ReSum) with OpenReward, letting you sample agentic trajectories without setting up environment infrastructure. Firehorse manages concurrent trial execution and produces structured trajectory logs and aggregate results.

Note: This is an experimental library testing our new composable toolset features on OpenReward.

Quickstart

Install the firehorse library:

pip install firehorse-cli

Set up your environment variables - get an OpenReward key here:

export OPENREWARD_API_KEY=your-openreward-key
export OPENROUTER_API_KEY=your-openrouter-key # or other env if using diff model provider

Ensure you have the harness CLI installed (in this case Claude Code) and then run:

# Run Claude Code agent against an environment
firehorse \
  --env Eigent/SETA \
  --agent claude-code \
  --model openrouter/moonshotai/kimi-k2.6
  --split train
  --output-dir ./kimi-seta

Prerequisites

  • Python 3.10+
  • OpenReward API key — get one at openreward.ai
  • LLM provider API key — Anthropic, OpenAI, Google, or OpenRouter

For specific agents:

  • claude-code — requires Claude Code CLI installed (tested with v2.1.88)
  • codex — requires Codex CLI installed (tested with v0.133.0)
  • gemini — requires Gemini CLI installed (tested with v0.38.2)
  • hermes — requires Hermes Agent CLI installed (tested with v0.15.2). Install with uv tool install hermes-agent.

Agent Types

Agent Description Providers
resum (default) ReAct loop with compaction when context fills up Anthropic, OpenAI, Google, OpenRouter, custom
claude-code Claude Code CLI with environment tools via MCP Anthropic, OpenAI, Google, OpenRouter
codex Codex CLI with environment tools via MCP OpenAI
gemini Gemini CLI with environment tools via MCP Google
hermes Hermes Agent CLI with environment tools via MCP Anthropic, OpenAI, OpenRouter, custom OpenAI-compatible
react Direct LLM API Reason-Act loop Anthropic, OpenAI, Google, OpenRouter, custom

Thinking / Reasoning

The --effort flag controls how much thinking/reasoning the model does. It's supported by all agents. Omitting the flag (or passing --effort none) leaves the reasoning parameter unset so each provider uses its own default. The effort level maps to each provider's native thinking mechanism:

Provider Mechanism low medium high max
Anthropic Adaptive thinking (effort param) low medium high max (Opus only)
OpenAI reasoning_effort low medium high xhigh
Google Gemini 3.x thinking_level low medium high high
OpenRouter Passes through to underlying provider
# High thinking (opt-in; default is no effort flag — provider picks its own default)
firehorse --env GeneralReasoning/CTF --model anthropic/claude-sonnet-4-6 --effort high

# Max thinking for deep reasoning tasks
firehorse --env Naman/R2E-Gym --agent codex --model openai/gpt-5.4 --split all --effort xhigh

# Low thinking for speed
firehorse --env collinear/YC-Bench --agent react --model google/gemini-3.1-flash-lite-preview --effort low

Each agent maps --effort to its provider's native parameter. Models that don't support thinking (e.g., GPT-4.1) ignore the flag.

CLI Reference

firehorse --env ENV --model MODEL [OPTIONS]

Required:
  --env              Environment name (e.g. MyOrg/my-environment)
  --model            Model identifier (e.g. anthropic/claude-sonnet-4-6)

Options:
  --agent            Agent type: claude-code, codex, gemini, hermes, react, resum (default: resum)
  --variant          Environment variant (e.g. 'mathnocode' for GeneralReasoning/MATH) (default: none)
  --split            Task split to evaluate (default: test)
  --n-concurrent     Max parallel trials (default: 1)
  --max-tasks        Limit number of tasks to evaluate
  --max-turns        Max tool call turns per trial
  --run-name         Name for this evaluation run
  --effort           Thinking effort: none, low, medium, high, max, xhigh (default: none — use model default)
  --provider-url     Custom API base URL for non-standard endpoints
  --output-dir       Directory for JSONL trajectory logs and results
  --secret KEY=VAL   Inject a session secret (repeatable)
  --disable-builtin-tools  Comma-separated list of tools to disable
  --use-env-descriptions   Use environment tool descriptions instead of built-in ones
  --use-all-filesystem-tools  Expose all filesystem tools via MCP (codex only)
  --no-logging       Disable OpenReward rollout streaming

Output

When --output-dir is specified, firehorse writes:

output_dir/
├── run_result.json          # Aggregate results across all trials
├── trial_0.jsonl            # Full agent trajectory
├── trial_0_result.json      # Per-trial summary (reward, tokens, cost, duration)
├── trial_0_rewards.jsonl    # Reward signal at each tool call
└── ...

Each trial result includes:

Field Description
reward Final reward score from the environment
finished Whether the environment signaled task completion
turns_used Number of tool call turns
input_tokens Total input tokens consumed
output_tokens Total output tokens consumed
cost_usd Estimated API cost
duration_seconds Wall-clock time

JSONL Trajectory Format

All agents share the same bookend events in trial_*.jsonl:

{"type": "openreward_prompt", "system_prompt": "...", "environment_prompt": "..."}
... agent-specific events ...
{"type": "openreward_summary", "task_spec": {...}, "env": "...", "model": "...", "usage": {...}}

The events in between depend on whether the agent is API-based or CLI-based.

API agents (react, resum) produce normalized firehorse events:

  • assistant — model response with text, reasoning, and tool calls
  • tool_call — tool invocation with name, arguments, and call ID (resum only; react embeds in the assistant event)
  • tool_result — tool output with explicit reward and finished fields

CLI agents (claude-code, codex, gemini, hermes) pass through the raw CLI stream format:

  • claude-code: Claude's stream-json events (assistant/user with message.content blocks)
  • codex: Codex's --json events (item.started/item.completed with nested item.type)
  • gemini: Gemini's stream-json events (message deltas, tool_use, tool_result)
  • hermes: Hermes runs in -Q quiet mode (no per-turn stream); the full transcript is exported post-hoc via hermes sessions export into trial_*_hermes_session.json.

Reward signals are available in the trial_*_rewards.jsonl sidecar file and in OpenReward rollouts.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Documentation

Full documentation is available at docs.openreward.ai.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

firehorse_cli-0.1.4.tar.gz (133.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

firehorse_cli-0.1.4-py3-none-any.whl (122.6 kB view details)

Uploaded Python 3

File details

Details for the file firehorse_cli-0.1.4.tar.gz.

File metadata

  • Download URL: firehorse_cli-0.1.4.tar.gz
  • Upload date:
  • Size: 133.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for firehorse_cli-0.1.4.tar.gz
Algorithm Hash digest
SHA256 fe9a9594377f78f2f74bab3b3599ca930c72951d335a2702aa5f72ba5489e891
MD5 b5498d222621efb1621ab44e575ed1e4
BLAKE2b-256 89da7511a8cd2868d8479383ba1b76b46535bd496b36fa6a16abbdb89d26a3a7

See more details on using hashes here.

File details

Details for the file firehorse_cli-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: firehorse_cli-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 122.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for firehorse_cli-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a3ef013b9ac928a1959028793fe3da996dd3bb6b6ea104f29c80f884c70328c5
MD5 5bd5c86ca1d5ec15892ef3b23b6d18fb
BLAKE2b-256 21d93f694d26379380172afcb93d6203a31d3ef7aa7a5e97855370e5f344f035

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page