Agent evaluation harness for OpenReward environments
Project description
🔥🐴 Firehorse is a library of agent harnesses for running models against OpenReward environments.
It bridges popular harnesses (Claude Code, Codex, Gemini CLI, ReAct, ReSum) with OpenReward, letting you sample agentic trajectories without setting up environment infrastructure. Firehorse manages concurrent trial execution and produces structured trajectory logs and aggregate results.
Quickstart
Install the firehorse library:
pip install firehorse-cli
Set up your environment variables - get an OpenReward key here:
export OPENREWARD_API_KEY=your-openreward-key
export OPENROUTER_API_KEY=your-openrouter-key # or other env if using diff model provider
Ensure you have the harness CLI installed (in this case Claude Code) and then run:
# Run Claude Code agent against an environment
firehorse \
--env Eigent/SETA \
--agent claude-code \
--model openrouter/moonshotai/kimi-k2.6
--split train
--output-dir ./kimi-seta
Prerequisites
- Python 3.10+
- OpenReward API key — get one at openreward.ai
- LLM provider API key — Anthropic, OpenAI, Google, or OpenRouter
For specific agents:
claude-code— requires Claude Code CLI installed (tested with v2.1.88)codex— requires Codex CLI installed (tested with v0.121.0)gemini— requires Gemini CLI installed (tested with v0.38.2)
Agent Types
| Agent | Description | Providers |
|---|---|---|
resum (default) |
ReAct loop with compaction when context fills up | Anthropic, OpenAI, Google, OpenRouter, custom |
claude-code |
Claude Code CLI with environment tools via MCP | Anthropic, OpenRouter |
codex |
Codex CLI with environment tools via MCP | OpenAI |
react |
Direct LLM API Reason-Act loop | Anthropic, OpenAI, Google, OpenRouter, custom |
gemini |
Gemini CLI with environment tools via MCP |
Thinking / Reasoning
The --effort flag controls how much thinking/reasoning the model does. It's supported by all agents. Omitting the flag (or passing --effort none) leaves the reasoning parameter unset so each provider uses its own default. The effort level maps to each provider's native thinking mechanism:
| Provider | Mechanism | low | medium | high | max |
|---|---|---|---|---|---|
| Anthropic | Adaptive thinking (effort param) |
low | medium | high | max (Opus only) |
| OpenAI | reasoning_effort |
low | medium | high | xhigh |
| Google Gemini 3.x | thinking_level |
low | medium | high | high |
| OpenRouter | Passes through to underlying provider | — | — | — | — |
# High thinking (opt-in; default is no effort flag — provider picks its own default)
firehorse --env GeneralReasoning/CTF --model anthropic/claude-sonnet-4-6 --effort high
# Max thinking for deep reasoning tasks
firehorse --env Naman/R2E-Gym --agent codex --model openai/gpt-5.4 --split all --effort xhigh
# Low thinking for speed
firehorse --env collinear/YC-Bench --agent react --model google/gemini-3.1-flash-lite-preview --effort low
Each agent maps --effort to its provider's native parameter. Models that don't support thinking (e.g., GPT-4.1) ignore the flag.
CLI Reference
firehorse --env ENV --model MODEL [OPTIONS]
Required:
--env Environment name (e.g. MyOrg/my-environment)
--model Model identifier (e.g. anthropic/claude-sonnet-4-6)
Options:
--agent Agent type: claude-code, codex, gemini, react, resum (default: resum)
--variant Environment variant (e.g. 'mathnocode' for GeneralReasoning/MATH) (default: none)
--split Task split to evaluate (default: test)
--n-concurrent Max parallel trials (default: 1)
--max-tasks Limit number of tasks to evaluate
--max-turns Max tool call turns per trial
--run-name Name for this evaluation run
--effort Thinking effort: none, low, medium, high, max, xhigh (default: none — use model default)
--provider-url Custom API base URL for non-standard endpoints
--output-dir Directory for JSONL trajectory logs and results
--secret KEY=VAL Inject a session secret (repeatable)
--disable-builtin-tools Comma-separated list of tools to disable
--use-env-descriptions Use environment tool descriptions instead of built-in ones
--use-all-filesystem-tools Expose all filesystem tools via MCP (codex only)
--no-logging Disable OpenReward rollout streaming
Output
When --output-dir is specified, firehorse writes:
output_dir/
├── run_result.json # Aggregate results across all trials
├── trial_0.jsonl # Full agent trajectory
├── trial_0_result.json # Per-trial summary (reward, tokens, cost, duration)
├── trial_0_rewards.jsonl # Reward signal at each tool call
└── ...
Each trial result includes:
| Field | Description |
|---|---|
reward |
Final reward score from the environment |
finished |
Whether the environment signaled task completion |
turns_used |
Number of tool call turns |
input_tokens |
Total input tokens consumed |
output_tokens |
Total output tokens consumed |
cost_usd |
Estimated API cost |
duration_seconds |
Wall-clock time |
JSONL Trajectory Format
All agents share the same bookend events in trial_*.jsonl:
{"type": "openreward_prompt", "system_prompt": "...", "environment_prompt": "..."}
... agent-specific events ...
{"type": "openreward_summary", "task_spec": {...}, "env": "...", "model": "...", "usage": {...}}
The events in between depend on whether the agent is API-based or CLI-based.
API agents (react, resum) produce normalized firehorse events:
assistant— model response with text, reasoning, and tool callstool_call— tool invocation with name, arguments, and call ID (resum only; react embeds in the assistant event)tool_result— tool output with explicitrewardandfinishedfields
CLI agents (claude-code, codex, gemini) pass through the raw CLI stream format:
- claude-code: Claude's
stream-jsonevents (assistant/userwithmessage.contentblocks) - codex: Codex's
--jsonevents (item.started/item.completedwith nesteditem.type) - gemini: Gemini's
stream-jsonevents (messagedeltas,tool_use,tool_result)
Reward signals are available in the trial_*_rewards.jsonl sidecar file and in OpenReward rollouts.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Documentation
Full documentation is available at docs.openreward.ai.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file firehorse_cli-0.1.0.tar.gz.
File metadata
- Download URL: firehorse_cli-0.1.0.tar.gz
- Upload date:
- Size: 99.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
379b473157b7e489a86c1664cd4d74e637f72f18b309864a65474c381e0b3c09
|
|
| MD5 |
3dfe626fb1946ebddb1819984f86c7f5
|
|
| BLAKE2b-256 |
36e6063b5729f4ab8628cd21dcfaf91f58dfcf0d2b376205c920fef2e1425cd3
|
File details
Details for the file firehorse_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: firehorse_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 86.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67c192756bc50982011595eac80c490c6b4241bba186519b19cb519b7bdbaeeb
|
|
| MD5 |
1fc00fa3d44afad32edcfbf91b303bac
|
|
| BLAKE2b-256 |
30f7439e623a8bcaf4956fb7bb64710c1eff866bdd785f35754a359aa583ce5c
|