benchmax

Framework-Agnostic RL Environments for LLM Fine-Tuning

These details have not been verified by PyPI

Project description

benchmax: Framework-Agnostic RL Environments for LLM Fine-Tuning

A lightweight, training-framework agnostic library for defining, running, and parallelizing environments, to fine-tune OSS LLMs with reinforcement learning.

Overview

benchmax comes with:

A collection of ready-to-use reinforcement learning (RL) environments for LLM fine-tuning ranging from multi-hop search to spreadsheet manipulation to CRM agents
An easy to define, compose, and parallelize your own environments, including leveraging the existing ecosystem of MCP servers
Built-in integrations with popular RL training libraries (verl, verifiers, etc.). benchmax is trainer-agnostic by design

Define your environment as:

A toolset (LLM calls, external APIs, calculators, MCPs, etc.).
Output parsing logic to extract structured observations.
Reward functions to score model outputs.

Rollout management, parallel execution, etc. comes out of the box.

⭐ Star our repository to show your support!

💡 Core Features

Built-in examples & templates

Get started with ready to use recipes, from Wikipedia search to spreadsheet manipulation. Easy to copy, customize, and extend. And yes, more are on the way.

Trainer Integrations

Use your own trainer or training framework - no lock-in. benchmax is already integrated into verl and verifiers, with more integrations (SkyRL, etc.) coming soon!

MCP Support Tap into the growing MCP ecosystem and integrate them as tools within your environments.

Parallel execution & state management

Local multi‐process pool
State is isolated across roll-outs (e.g. editing files on local filesystem, etc.)
Multi-Node Parallelization (Coming soon!)

📘 Quickstart

Example: Math Question Answering with a Calculator MCP

verl is a training framework benchmax is currently integrated with. Use our verl integration to RL finetune Qwen-3 to do math using a calculator MCP (https://github.com/githejie/mcp-server-calculator). The environment is defined at benchmax.envs.math.math_env.MathEnv

Installation

pip install benchmax[verl]

* Note that benchmax installs our verl fork (temporary until PR gets merged)

Prepare the dataset

python benchmax/adapters/verl/benchmax_data_process.py \
  --local_dir ~/data/math \
  --dataset_name dawidmt/arithmetic50 \
  --env_path benchmax.envs.math.math_env.MathEnv

Run training

sh examples/verl/run_qwen2.5-3b_benchmax_math.sh

This math environment is just a quick example. Explore some of the more complex environments like excel, crm in benchmax/envs.

🌐 Creating & Training with Environments

What is an environment?

An environment consists of:

A list of tools that an LLM can call
A list of reward functions that evaluate the quality & correctness of the model's final output.

We also support MCP servers natively, allowing you to easily leverage the many servers built by the community.

Pre-built environments

Ready-to-use environments with pre-configured tools and reward functions.

How do I create a custom environment?

With existing MCP Servers

To create a custom environment using an MCP server (like a calculator, browser, or spreadsheet), you can extend LocalMCPEnv. Here's a quick step-by-step guide using benchmax.envs.math.math_env.MathEnv as an example.

1. Define a System Prompt

This prompt guides the LLM’s behavior. It can include any instruction, such as how to format the answer or when to use tools.

SYSTEM_PROMPT = """Please use the tools provided to do any computation.
Write your complete answer on the final line only, within the xml tags <answer></answer>.
"""

2. Configure MCP Server(s)

Define the MCP servers to be launched. You can configure one or more:

MCP_CONFIG = """
{
  "mcpServers": {
    "server-name": {
      "command": "uvx",
      "args": ["mcp_server_calculator"]
    }
  }
}
"""

3. Write a Reward Function

The reward function evaluates how "correct" the model's output is, based on structured output. Here’s a simple XML-based example:

Note that **kwargs contains all the other fields in your dataset, so feel free to use them in reward_func calculations.

def reward_func(prompt, completion, ground_truth, workspace, **kwargs):
    m = re.search(r'<answer>(.*?)</answer>', completion, flags=re.IGNORECASE | re.DOTALL)
    if not m:
        return 0.0
    answer_text = unescape(m.group(1)).strip().lower()
    return float(ground_truth.lower() == answer_text)

4. Define `dataset_preprocess`

If your dataset is not already standardized, implement this method to convert a raw example into a standardized one with:

"prompt": A fully constructed string prompt.
"ground_truth": A known correct output (optional depending on reward).
"init_rollout_args": Arguments needed to initialize a rollout.

Example for our math task:

def dataset_preprocess(self, example: dict) -> StandardizedExample:
    return StandardizedExample(
        prompt=example.get("task", ""),
        ground_truth=example.get("answer", ""),
        init_rollout_args={}
    )

Notes on init_rollout_args

The `init_rollout_args` dictionary is passed from `dataset_preprocess()` to your environment's `init_rollout()` method. It is used to initialize any **per-example files, resources, or execution context** needed before a rollout begins.

Common use cases include:

Input files: For environments that manipulate files like spreadsheets, images, or databases, pass the necessary file paths.
Version control: For code-related tasks, you might pass a commit_id to check out the correct code state.
Task-specific settings: Pass metadata like cell ranges, task IDs, or execution flags.

Example:

# Inside dataset_preprocess
return {
    "prompt": "...",
    "ground_truth": "...",
    "init_rollout_args": {
        "spreadsheet_path": "/path/to/1_001_input.xlsx"
    }
}

Then in your init_rollout() method:

def init_rollout(self, rollout_id: str, **rollout_args):
    spreadsheet_path = rollout_args["spreadsheet_path"]
    workspace = self.get_rollout_workspace(rollout_id)

    # Copy the input file into the rollout's workspace
    shutil.copy(spreadsheet_path, workspace / Path(spreadsheet_path).name)

This pattern ensures each rollout starts with the correct inputs and configuration.

5. Extend `LocalMCPEnv`

Now bring everything together into a custom environment class:

from envs.local_mcp_env import LocalMCPEnv
from typing import List

class MathEnv(LocalMCPEnv):
    """Environment for math problems, using local MCP tools."""

    system_prompt: str = SYSTEM_PROMPT
    reward_funcs: List[RewardFunction] = [reward_func]

    def __init__(self, **kwargs):
        super().__init__(MCP_CONFIG)
    
    def dataset_preprocess(self, example: Any) -> StandardizedExample:
        return StandardizedExample(
            prompt=example.get("task", ""),
            ground_truth=example.get("answer", ""),
            init_rollout_args={}
        )

You're done! This environment is now compatible with benchmax and can be plugged into any compatible RL trainer.

Extend BaseEnv

If you don’t need MCP servers, you can build a environment from scratch by extending `BaseEnv` directly. Here's how to make a minimal math environment with a single tool: an arithmetic evaluator.

1. Define the system prompt

This helps instruct the model on how to interact with the tool and format output.

SYSTEM_PROMPT = """Use the `evaluate` tool to perform any computation.
Write your final answer on the last line inside <answer>...</answer>.
"""

2. Create a reward function

We'll score the model 1.0 if it places the correct answer inside <answer>...</answer> tags:

import re
from html import unescape
from pathlib import Path

def reward_func(prompt: str, completion: str, ground_truth: str, workspace: Path, **kwargs) -> float:
    m = re.search(r'<answer>(.*?)</answer>', completion, flags=re.IGNORECASE | re.DOTALL)
    if not m:
        return 0.0
    answer_text = unescape(m.group(1)).strip().lower()
    return float(answer_text == ground_truth.lower())

3. Define your math tool

A simple safe eval for math expressions:

def evaluate_expression(expr: str) -> str:
    try:
        result = eval(expr, {"__builtins__": {}})
        return str(result)
    except Exception as e:
        return f"Error: {str(e)}"

4. Create the environment class

Bring it all together in a subclass of BaseEnv:

class SimpleMathEnv(BaseEnv):
    system_prompt: str = SYSTEM_PROMPT
    _reward_funcs: List[RewardFunction] = [reward_func]

    def __init__(self):
        eval_tool = ToolDefinition(
            name="evaluate",
            description="Safely evaluate a math expression like '2 + 3 * 4'.",
            input_schema={
                "type": "object",
                "properties": {
                    "expr": {
                        "type": "string",
                        "description": "Math expression to evaluate.",
                    },
                },
                "required": ["expr"],
            }
        )
        self.tools: Dict[str, Tuple[ToolDefinition, Callable]] = {
            "evaluate": (eval_tool, evaluate_expression)
        }
    def dataset_preprocess(self, example: dict) -> StandardizedExample:
        return {
            "prompt": f"Question: {example['question']}\n\nWrite your answer below.",
            "ground_truth": example.get("answer", ""),
            "init_rollout_args": {}
    }

    def list_tools(self) -> List[ToolDefinition]:
        return [tool_def for tool_def, _ in self.tools.values()]

    def run_tool(self, rollout_id: str, tool_name: str, **tool_args) -> Any:
        _, tool_fn = self.tools[tool_name]
        return tool_fn(**tool_args)

How about more complex environments?

Check out our excel spreadsheet RL environment: benchmax.envs.excel.excel_env.ExcelEnv

How do I use an environment with my preferred RL Trainer?

We currently have integrations with both verifiers and verl. More incoming!

benchmax environments with verl

benchmax environments with verifiers

I want a specific environment

Open an issue and tag us & we will look into building you one!

🎯 Motivation

Modularity and Simplicity:

We set out to build a lightweight, modular system for defining RL environments—breaking them down into simple, composable parts: tools, tool output parsing, and reward functions.

The goal’s to make it easy for software engineers to build and experiment with RL environments without needing deep RL expertise.
Trainer Integrations:

There’s been lots of new RL training frameworks popping up (e.g., numerous forks of verl) & we expect this to continue. They are often tightly coupled with specific environments, leading to fragmentation and limited compatibility.

We are building benchmax as a standalone library with integrations to these different training frameworks & as an easy way for new frameworks to tap into an existing pool of environments. We're already integrated with verl and verifiers. More integrations (e.g. SkyRL) coming soon!
Task Recipes and Ideas:

We want benchmax to be a living library of reusable, RL-compatible task recipes, ready to inspire and extend beyond the usual suspects like math and coding. We aim to support more real-world workflows, including open-ended and long-horizon tasks.
Parallelization and Cloud Compatibility:
- Enable efficient parallelization with maintained statefulness between rollouts.
- Facilitate easy deployment and scalability in cloud environments.
MCP as a first class citizen:

There has been an explosion of MCP servers/tools built out for usecases ranging from browser use to excel to game creation.benchmax allows folks to leverage and compose these existing MCP servers to build environments integrated with real world systems e.g. excel

🤝 Contributing

We welcome new environment recipes, bug reports, and trainer integrations!

⭐ Star our repository to show your support!

📜 License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2.dev33 pre-release

Jun 11, 2026

0.1.2.dev31 pre-release

Jun 11, 2026

0.1.2.dev30 pre-release

Jun 10, 2026

0.1.2.dev29 pre-release

Jun 5, 2026

0.1.2.dev28 pre-release

Jun 4, 2026

0.1.2.dev27 pre-release

May 30, 2026

0.1.2.dev26 pre-release

May 30, 2026

0.1.2.dev25 pre-release

May 26, 2026

0.1.2.dev23 pre-release

Apr 22, 2026

0.1.2.dev22 pre-release

Apr 18, 2026

0.1.2.dev21 pre-release

Mar 30, 2026

0.1.2.dev20 pre-release

Mar 30, 2026

0.1.2.dev19 pre-release

Mar 26, 2026

0.1.2.dev18 pre-release

Mar 11, 2026

0.1.2.dev17 pre-release

Mar 10, 2026

0.1.2.dev16 pre-release

Feb 26, 2026

0.1.2.dev15 pre-release

Feb 25, 2026

0.1.2.dev14 pre-release

Feb 19, 2026

0.1.2.dev13 pre-release

Feb 14, 2026

0.1.2.dev12 pre-release

Feb 13, 2026

0.1.2.dev11 pre-release

Feb 13, 2026

0.1.2.dev10 pre-release

Feb 13, 2026

0.1.2.dev9 pre-release

Feb 9, 2026

0.1.2.dev8 pre-release

Jan 29, 2026

0.1.2.dev7 pre-release yanked

Jan 29, 2026

Reason this release was yanked:

bad

0.1.2.dev6 pre-release

Jan 14, 2026

0.1.2.dev5 pre-release

Nov 29, 2025

0.1.2.dev4 pre-release

Nov 22, 2025

0.1.2.dev3 pre-release

Nov 17, 2025

0.1.2.dev2 pre-release

Nov 16, 2025

0.1.2.dev1 pre-release

Oct 30, 2025

0.1.2.dev0 pre-release

Oct 27, 2025

This version

0.1.1.dev7 pre-release

Sep 19, 2025

0.1.1.dev6 pre-release

Sep 18, 2025

0.1.1.dev5 pre-release

Aug 25, 2025

0.1.1.dev4 pre-release

Jul 29, 2025

0.1.1.dev3 pre-release

Jul 29, 2025

0.1.1.dev2 pre-release

Jul 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmax-0.1.1.dev7.tar.gz (53.9 kB view details)

Uploaded Sep 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

benchmax-0.1.1.dev7-py3-none-any.whl (63.4 kB view details)

Uploaded Sep 19, 2025 Python 3

File details

Details for the file benchmax-0.1.1.dev7.tar.gz.

File metadata

Download URL: benchmax-0.1.1.dev7.tar.gz
Upload date: Sep 19, 2025
Size: 53.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.4 Darwin/24.6.0

File hashes

Hashes for benchmax-0.1.1.dev7.tar.gz
Algorithm	Hash digest
SHA256	`f639ad9b1406c4ada8ee045af123ed3d8a188d614dbd615a19a097704573afe5`
MD5	`8280945007292eca4bd9458aaf117609`
BLAKE2b-256	`c4e54b470f7c2cda871fb793f5ed9d315a79ddba610fd4f7ffa11612d5c476cb`

See more details on using hashes here.

File details

Details for the file benchmax-0.1.1.dev7-py3-none-any.whl.

File metadata

Download URL: benchmax-0.1.1.dev7-py3-none-any.whl
Upload date: Sep 19, 2025
Size: 63.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.4 Darwin/24.6.0

File hashes

Hashes for benchmax-0.1.1.dev7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7bc77808dc284cccc9cfb9560e623d53d461e12a89a2d48ab43edced900c4ea`
MD5	`5b571f24785faaf9a2e5f0cee661f152`
BLAKE2b-256	`f3601f0ab842dc62206b1b1a0267760178c94b633b3bf8c4fe0c146b67713f33`

See more details on using hashes here.

benchmax 0.1.1.dev7

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

benchmax: Framework-Agnostic RL Environments for LLM Fine-Tuning

Overview

💡 Core Features

📘 Quickstart

🌐 Creating & Training with Environments

What is an environment?

Pre-built environments

How do I create a custom environment?

1. Define a System Prompt

2. Configure MCP Server(s)

3. Write a Reward Function

4. Define dataset_preprocess

5. Extend LocalMCPEnv

1. Define the system prompt

2. Create a reward function

3. Define your math tool

4. Create the environment class

How about more complex environments?

How do I use an environment with my preferred RL Trainer?

I want a specific environment

🎯 Motivation

🤝 Contributing

📜 License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

4. Define `dataset_preprocess`

5. Extend `LocalMCPEnv`