Framework-Agnostic RL Environments for LLM Fine-Tuning
Project description
benchmax: Framework-Agnostic RL Environments for LLM Fine-Tuning
A lightweight, training-framework agnostic library for defining, running, and parallelizing environments, to fine-tune OSS LLMs with reinforcement learning.
Overview
benchmax comes with:
- A collection of ready-to-use reinforcement learning (RL) environments for LLM fine-tuning ranging from multi-hop search to spreadsheet manipulation to CRM agents
- An easy to define, compose, and parallelize your own environments, including leveraging the existing ecosystem of MCP servers
- Built-in integrations with popular RL training libraries (verl, verifiers, etc.).
benchmaxis trainer-agnostic by design
Define your environment as:
- A toolset (LLM calls, external APIs, calculators, MCPs, etc.).
- Output parsing logic to extract structured observations.
- Reward functions to score model outputs.
Rollout management, parallel execution, etc. comes out of the box.
⭐ Star our repository to show your support!
💡 Core Features
Built-in examples & templates
Get started with ready to use recipes, from Wikipedia search to spreadsheet manipulation. Easy to copy, customize, and extend. And yes, more are on the way.
Trainer Integrations
Use your own trainer or training framework - no lock-in. benchmax is already integrated into verl and verifiers, with more integrations (SkyRL, etc.) coming soon!
MCP Support Tap into the growing MCP ecosystem and integrate them as tools within your environments.
Parallel execution & state management
- Local multi‐process pool
- State is isolated across roll-outs (e.g. editing files on local filesystem, etc.)
- Multi-Node Parallelization (Coming soon!)
📘 Quickstart
Example: Math Question Answering with a Calculator MCP
verl is a training framework benchmax is currently integrated with. Use our verl integration to RL finetune Qwen-3 to do math using a calculator MCP (https://github.com/githejie/mcp-server-calculator). The environment is defined at benchmax.envs.math.math_env.MathEnv
-
Installation
pip install benchmax[verl]* Note that benchmax installs our verl fork (temporary until PR gets merged)
-
Prepare the dataset
python benchmax/adapters/verl/benchmax_data_process.py \ --local_dir ~/data/math \ --dataset_name dawidmt/arithmetic50 \ --env_path benchmax.envs.math.math_env.MathEnv
-
Run training
sh examples/verl/run_qwen2.5-3b_benchmax_math.sh
This math environment is just a quick example. Explore some of the more complex environments like excel, crm in benchmax/envs.
🌐 Creating & Training with Environments
What is an environment?
An environment consists of:
- A list of tools that an LLM can call
- A list of reward functions that evaluate the quality & correctness of the model's final output.
We also support MCP servers natively, allowing you to easily leverage the many servers built by the community.
Pre-built environments
Ready-to-use environments with pre-configured tools and reward functions.
How do I create a custom environment?
With existing MCP Servers
To create a custom environment using an MCP server (like a calculator, browser, or spreadsheet), you can extend LocalMCPEnv. Here's a quick step-by-step guide using benchmax.envs.math.math_env.MathEnv as an example.
1. Define a System Prompt
This prompt guides the LLM’s behavior. It can include any instruction, such as how to format the answer or when to use tools.
SYSTEM_PROMPT = """Please use the tools provided to do any computation.
Write your complete answer on the final line only, within the xml tags <answer></answer>.
"""
2. Configure MCP Server(s)
Define the MCP servers to be launched. You can configure one or more:
MCP_CONFIG = """
{
"mcpServers": {
"server-name": {
"command": "uvx",
"args": ["mcp_server_calculator"]
}
}
}
"""
3. Write a Reward Function
The reward function evaluates how "correct" the model's output is, based on structured output. Here’s a simple XML-based example:
Note that **kwargs contains all the other fields in your dataset, so feel free to use them in reward_func calculations.
def reward_func(prompt, completion, ground_truth, workspace, **kwargs):
m = re.search(r'<answer>(.*?)</answer>', completion, flags=re.IGNORECASE | re.DOTALL)
if not m:
return 0.0
answer_text = unescape(m.group(1)).strip().lower()
return float(ground_truth.lower() == answer_text)
4. Define dataset_preprocess
If your dataset is not already standardized, implement this method to convert a raw example into a standardized one with:
"prompt": A fully constructed string prompt."ground_truth": A known correct output (optional depending on reward)."init_rollout_args": Arguments needed to initialize a rollout.
Example for our math task:
def dataset_preprocess(self, example: dict) -> StandardizedExample:
return StandardizedExample(
prompt=example.get("task", ""),
ground_truth=example.get("answer", ""),
init_rollout_args={}
)
Notes on init_rollout_args
The `init_rollout_args` dictionary is passed from `dataset_preprocess()` to your environment's `init_rollout()` method. It is used to initialize any **per-example files, resources, or execution context** needed before a rollout begins.Common use cases include:
- Input files: For environments that manipulate files like spreadsheets, images, or databases, pass the necessary file paths.
- Version control: For code-related tasks, you might pass a
commit_idto check out the correct code state. - Task-specific settings: Pass metadata like cell ranges, task IDs, or execution flags.
Example:
# Inside dataset_preprocess
return {
"prompt": "...",
"ground_truth": "...",
"init_rollout_args": {
"spreadsheet_path": "/path/to/1_001_input.xlsx"
}
}
Then in your init_rollout() method:
def init_rollout(self, rollout_id: str, **rollout_args):
spreadsheet_path = rollout_args["spreadsheet_path"]
workspace = self.get_rollout_workspace(rollout_id)
# Copy the input file into the rollout's workspace
shutil.copy(spreadsheet_path, workspace / Path(spreadsheet_path).name)
This pattern ensures each rollout starts with the correct inputs and configuration.
5. Extend LocalMCPEnv
Now bring everything together into a custom environment class:
from envs.local_mcp_env import LocalMCPEnv
from typing import List
class MathEnv(LocalMCPEnv):
"""Environment for math problems, using local MCP tools."""
system_prompt: str = SYSTEM_PROMPT
reward_funcs: List[RewardFunction] = [reward_func]
def __init__(self, **kwargs):
super().__init__(MCP_CONFIG)
def dataset_preprocess(self, example: Any) -> StandardizedExample:
return StandardizedExample(
prompt=example.get("task", ""),
ground_truth=example.get("answer", ""),
init_rollout_args={}
)
You're done! This environment is now compatible with benchmax and can be plugged into any compatible RL trainer.
Extend BaseEnv
If you don’t need MCP servers, you can build a environment from scratch by extending `BaseEnv` directly. Here's how to make a minimal math environment with a single tool: an arithmetic evaluator.1. Define the system prompt
This helps instruct the model on how to interact with the tool and format output.
SYSTEM_PROMPT = """Use the `evaluate` tool to perform any computation.
Write your final answer on the last line inside <answer>...</answer>.
"""
2. Create a reward function
We'll score the model 1.0 if it places the correct answer inside <answer>...</answer> tags:
import re
from html import unescape
from pathlib import Path
def reward_func(prompt: str, completion: str, ground_truth: str, workspace: Path, **kwargs) -> float:
m = re.search(r'<answer>(.*?)</answer>', completion, flags=re.IGNORECASE | re.DOTALL)
if not m:
return 0.0
answer_text = unescape(m.group(1)).strip().lower()
return float(answer_text == ground_truth.lower())
3. Define your math tool
A simple safe eval for math expressions:
def evaluate_expression(expr: str) -> str:
try:
result = eval(expr, {"__builtins__": {}})
return str(result)
except Exception as e:
return f"Error: {str(e)}"
4. Create the environment class
Bring it all together in a subclass of BaseEnv:
class SimpleMathEnv(BaseEnv):
system_prompt: str = SYSTEM_PROMPT
_reward_funcs: List[RewardFunction] = [reward_func]
def __init__(self):
eval_tool = ToolDefinition(
name="evaluate",
description="Safely evaluate a math expression like '2 + 3 * 4'.",
input_schema={
"type": "object",
"properties": {
"expr": {
"type": "string",
"description": "Math expression to evaluate.",
},
},
"required": ["expr"],
}
)
self.tools: Dict[str, Tuple[ToolDefinition, Callable]] = {
"evaluate": (eval_tool, evaluate_expression)
}
def dataset_preprocess(self, example: dict) -> StandardizedExample:
return {
"prompt": f"Question: {example['question']}\n\nWrite your answer below.",
"ground_truth": example.get("answer", ""),
"init_rollout_args": {}
}
def list_tools(self) -> List[ToolDefinition]:
return [tool_def for tool_def, _ in self.tools.values()]
def run_tool(self, rollout_id: str, tool_name: str, **tool_args) -> Any:
_, tool_fn = self.tools[tool_name]
return tool_fn(**tool_args)
How about more complex environments?
- Check out our excel spreadsheet RL environment:
benchmax.envs.excel.excel_env.ExcelEnv
How do I use an environment with my preferred RL Trainer?
We currently have integrations with both verifiers and verl. More incoming!
benchmax environments with verl
benchmax environments with verifiers
I want a specific environment
Open an issue and tag us & we will look into building you one!
🎯 Motivation
-
Modularity and Simplicity:
We set out to build a lightweight, modular system for defining RL environments—breaking them down into simple, composable parts: tools, tool output parsing, and reward functions.
The goal’s to make it easy for software engineers to build and experiment with RL environments without needing deep RL expertise.
-
Trainer Integrations:
There’s been lots of new RL training frameworks popping up (e.g., numerous forks of verl) & we expect this to continue. They are often tightly coupled with specific environments, leading to fragmentation and limited compatibility.
We are building
benchmaxas a standalone library with integrations to these different training frameworks & as an easy way for new frameworks to tap into an existing pool of environments. We're already integrated with verl and verifiers. More integrations (e.g. SkyRL) coming soon! -
Task Recipes and Ideas:
We want
benchmaxto be a living library of reusable, RL-compatible task recipes, ready to inspire and extend beyond the usual suspects like math and coding. We aim to support more real-world workflows, including open-ended and long-horizon tasks. -
Parallelization and Cloud Compatibility:
- Enable efficient parallelization with maintained statefulness between rollouts.
- Facilitate easy deployment and scalability in cloud environments.
-
MCP as a first class citizen:
There has been an explosion of MCP servers/tools built out for usecases ranging from browser use to excel to game creation.
benchmaxallows folks to leverage and compose these existing MCP servers to build environments integrated with real world systems e.g. excel
🤝 Contributing
We welcome new environment recipes, bug reports, and trainer integrations!
⭐ Star our repository to show your support!
📜 License
Apache 2.0 © 2025 CGFT Inc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchmax-0.1.1.dev7.tar.gz.
File metadata
- Download URL: benchmax-0.1.1.dev7.tar.gz
- Upload date:
- Size: 53.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.4 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f639ad9b1406c4ada8ee045af123ed3d8a188d614dbd615a19a097704573afe5
|
|
| MD5 |
8280945007292eca4bd9458aaf117609
|
|
| BLAKE2b-256 |
c4e54b470f7c2cda871fb793f5ed9d315a79ddba610fd4f7ffa11612d5c476cb
|
File details
Details for the file benchmax-0.1.1.dev7-py3-none-any.whl.
File metadata
- Download URL: benchmax-0.1.1.dev7-py3-none-any.whl
- Upload date:
- Size: 63.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.4 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7bc77808dc284cccc9cfb9560e623d53d461e12a89a2d48ab43edced900c4ea
|
|
| MD5 |
5b571f24785faaf9a2e5f0cee661f152
|
|
| BLAKE2b-256 |
f3601f0ab842dc62206b1b1a0267760178c94b633b3bf8c4fe0c146b67713f33
|