RL training environments with verifiable rewards for coding agents
Project description
DeepGym
RL training environments with verifiable rewards for coding agents.
You give it model-generated code. It runs the code in a sandbox, checks it against a verifier, and hands back a score. That score plugs straight into TRL, verl, OpenRLHF, or whatever you're using for GRPO/DAPO/PPO.
Quick Start
pip install deepgym
from deepgym import DeepGym, load_environment
dg = DeepGym(mode='local')
env = load_environment('coin_change')
solution = '''
def coin_change(coins, amount):
dp = [float('inf')] * (amount + 1)
dp[0] = 0
for coin in coins:
for x in range(coin, amount + 1):
dp[x] = min(dp[x], dp[x - coin] + 1)
return dp[amount] if dp[amount] != float('inf') else -1
'''
result = dg.run(env, model_output=solution)
print(f'Score: {result.score}, Passed: {result.passed}')
How it works
prompt --> model --> DeepGym sandbox --> verifier --> score --> training loop
| |
Daytona / JSON protocol:
local subprocess score, passed, reward_components
Model writes code. DeepGym runs it. Verifier checks it. You get back a score and per-test-case breakdown showing exactly which tests passed and which didn't.
What's in the box
- 24 built-in coding environments (ship with pip install)
- 2,350+ importable benchmarks (HumanEval, MBPP, BigCodeBench, EvalPlus)
- SWE-bench Pro support for repo-level patch RL tasks
- Terminal-Bench 2.0 support for shell/terminal RL tasks
- MixedEnvironment routing for multi-benchmark training in one reward function
- Per-test-case reward traces (not just pass/fail -- you see which tests broke)
- Deterministic seeding (same input, same score, every time)
- Three runtime modes: local subprocess, self-hosted Daytona, cloud Daytona
- Drop-in reward functions for TRL, verl, OpenRLHF
- lm-evaluation-harness task adapter (evaluate with
lm_eval --tasks deepgym_*) - HuggingFace Hub integration (share environments as HF datasets)
- Batch scoring for GRPO (score N completions in parallel)
- Gymnasium-style API if you prefer reset/step/state
Usage
Score a single solution
from deepgym import DeepGym, load_environment
dg = DeepGym(mode='local')
env = load_environment('two_sum')
result = dg.run(env, model_output='def two_sum(nums, target): ...')
print(result.score, result.passed, result.reward_components)
Batch scoring for GRPO
solutions = [model.generate(prompt) for _ in range(8)]
batch = dg.run_batch(env, solutions, max_parallel=8)
scores = [r.score for r in batch.results]
# GRPO advantage: (r - mean) / std
mean = sum(scores) / len(scores)
std = (sum((s - mean) ** 2 for s in scores) / len(scores)) ** 0.5
advantages = [(s - mean) / (std + 1e-8) for s in scores]
Drop into TRL
from deepgym.integrations.trl import make_trl_reward_fn
from trl import GRPOTrainer
reward_fn = make_trl_reward_fn(env)
trainer = GRPOTrainer(model=model, reward_funcs=[reward_fn])
trainer.train()
Drop into DAPO
from deepgym.integrations.dapo import make_dapo_reward_fn
reward_fn = make_dapo_reward_fn(env)
scores = reward_fn(completions=['def solve(x): return x'])
For verl-style DAPO recipes, DeepGym also exposes thin helpers to generate a reward module and a minimal config snippet:
from deepgym.integrations.dapo import (
generate_dapo_reward_module,
generate_dapo_verl_config,
)
reward_module = generate_dapo_reward_module('coin_change')
config_yaml = generate_dapo_verl_config(
train_files='data/train.parquet',
reward_module_path='reward_module.py',
)
Train on repo patches with SWE-bench Pro
from deepgym import DeepGym, load_environment
dg = DeepGym(mode='auto')
env = load_environment('swebench_pro')
result = dg.run(
env,
model_output='''```diff\n... unified diff ...\n```''',
repo='owner/repo',
base_commit='abc123',
test_patch='diff --git ...',
fail_to_pass=['tests/test_bug.py::test_fix'],
pass_to_pass=['tests/test_smoke.py::test_smoke'],
)
print(result.score)
Train on terminal tasks with Terminal-Bench 2.0
from deepgym import DeepGym, load_environment
dg = DeepGym(mode='auto')
env = load_environment('terminal_bench_2')
result = dg.run(
env,
model_output='python solve.py --input data.txt > output.txt',
task_id='regex-log',
)
print(result.score)
Mix multiple benchmarks behind one reward function
from deepgym import MixedEnvironment, load_environment
from deepgym.integrations.trl import make_trl_reward_fn
swe_env = load_environment('swebench_pro')
terminal_env = load_environment('terminal_bench_2')
humaneval_env = load_environment('coin_change')
mixed = MixedEnvironment([
(swe_env, 0.6),
(terminal_env, 0.2),
(humaneval_env, 0.2),
])
reward_fn = make_trl_reward_fn(mixed)
Drop into verl
from deepgym.integrations.verl import make_verl_compute_score
compute_score = make_verl_compute_score(env)
# In verl config: custom_reward_function.path = "your_reward_module.py"
Drop into OpenRLHF
from fastapi import FastAPI
from deepgym.integrations.openrlhf import create_openrlhf_router
app = FastAPI()
app.include_router(create_openrlhf_router(env, dg))
# Run with: uvicorn app:app --port 8000
Use with lm-evaluation-harness
from deepgym.integrations.lm_eval import register_deepgym_tasks
register_deepgym_tasks() # registers deepgym_* tasks
# lm_eval --model hf --model_args pretrained=Qwen/Qwen2-0.5B-Instruct \
# --tasks deepgym_coin_change,deepgym_two_sum
Share environments on HuggingFace Hub
from deepgym.integrations.hf import push_environment_to_hub, load_environment_from_hub
# Push to HF Hub
push_environment_to_hub(env, repo_id='your-org/deepgym-coin-change', env_name='coin_change')
# Load from anywhere
env = load_environment_from_hub('your-org/deepgym-coin-change')
Write your own verifier
from deepgym import DeepGym, Environment
dg = DeepGym(mode='local')
env = Environment(
task='Write a function `add(a, b)` that returns the sum of two numbers.',
verifier_code=(
'import importlib.util, sys\n'
'spec = importlib.util.spec_from_file_location("sol", solution_path)\n'
'mod = importlib.util.module_from_spec(spec)\n'
'spec.loader.exec_module(mod)\n'
'return 1.0 if hasattr(mod, "add") and mod.add(2, 3) == 5 else 0.0\n'
),
)
result = dg.run(env, model_output='def add(a, b):\n return a + b\n')
The verifier_code string becomes the body of a function that gets
(solution_path, test_cases_path=None). Return a float, bool, or dict.
The wrapper handles the rest.
Environments
Built-in (24, ship with pip install)
Coding (20):
- Array/String: reverse_string, palindrome_check, anagram_check, max_subarray, rotate_array, remove_duplicates, valid_parentheses
- Hash Map: group_anagrams, longest_consecutive, top_k_frequent
- DP: climbing_stairs, coin_change, longest_common_subsequence, house_robber
- Graph/Tree: binary_search, merge_intervals, level_order_traversal
- Practical: fizzbuzz, roman_to_integer, matrix_spiral
Computer-use (2): file_organizer, cli_task
Tool-use (2): api_request, data_pipeline
Load by name:
from deepgym import load_environment
env = load_environment('coin_change')
Importable benchmarks
Run the import scripts to pull in standard benchmarks:
python scripts/import_humaneval.py # 164 problems
python scripts/import_evalplus.py # HumanEval+ (80x more tests) + MBPP+
python scripts/import_mbpp.py # 500 problems
python scripts/import_bigcodebench.py # 1,140 problems
After import, they're available through load_environment().
Benchmark-backed special environments
These names resolve directly through load_environment() and use custom execution paths:
swebench_pro: repo clone -> checkout -> patch apply -> test run -> score by pass fractionterminal_bench_2: execute terminal commands in a task sandbox -> verify expected output/state
MixedEnvironment lets you combine these with built-in or imported coding environments while keeping the same reward-function surface.
Verifier protocol
Verifiers output JSON to stdout:
{
"schema_version": "1.0",
"score": 0.85,
"passed": true,
"details": "12/14 tests passed",
"cases": [
{"id": "test_0", "passed": true, "score": 1.0, "input_summary": "coins=[1,2,5] amount=11"},
{"id": "test_1", "passed": false, "score": 0.0, "error": "expected 3, got -1"}
],
"seed": 42
}
The cases field is the interesting part -- it tells you exactly which tests
passed and failed, so your training loop gets a denser signal than just 0 or 1.
Simple verifiers that return a float or bool get auto-wrapped to this format.
Architecture
Training Framework (verl / OpenRLHF / TRL)
|
v
DeepGym (environments + verifiers + scoring)
|
v
Daytona sandbox / local subprocess
Three modes: local (subprocess, no deps, no isolation), daytona (real container isolation), auto (tries Daytona, falls back to local).
Local mode is fine for development. For anything shared or untrusted, use Daytona.
CLI
deepgym run --task task.md --verifier verifier.py --solution solution.py
# Dev mode
DEEPGYM_NO_AUTH=true deepgym serve --host 127.0.0.1 --port 8000 --allow-local-exec
# Production
DEEPGYM_API_KEY=your-key DAYTONA_API_KEY=your-key deepgym serve --port 8000
The server won't start without auth configured. Set DEEPGYM_API_KEY for
production or DEEPGYM_NO_AUTH=true for local development.
--allow-local-exec is required when running without Daytona.
Development
pip install -e ".[dev]" # install with test deps
pytest # 291 tests
ruff check src/ # lint
ruff format src/ # format
Release
PyPI publishing is tag-driven in GitHub Actions.
git tag v0.3.0
git push origin v0.3.0
Pushing a normal branch commit runs CI only. Pushing a v* tag runs the publish job and uploads the package to PyPI.
Daytona setup
git clone https://github.com/daytonaio/daytona
docker compose -f docker/docker-compose.yaml up -d
# Set DAYTONA_API_URL and DAYTONA_API_KEY
Or use Daytona cloud: get a key from app.daytona.io.
Built with Daytona
DeepGym is part of the Daytona Startup Grid.
We use Daytona to power fast, isolated execution for modern agent training and evaluation workflows.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepgym-0.3.0.tar.gz.
File metadata
- Download URL: deepgym-0.3.0.tar.gz
- Upload date:
- Size: 208.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
976146245bd0f8efa7b41858df7cebb6dbaacdb4896be174aed07ddcfaf4c484
|
|
| MD5 |
4a9b5caa86fb511c2bff5b9ab791231a
|
|
| BLAKE2b-256 |
098590574ea540c508358718d62efd48feab5a79fea4c5f32a5c3a68f88b1ff7
|
File details
Details for the file deepgym-0.3.0-py3-none-any.whl.
File metadata
- Download URL: deepgym-0.3.0-py3-none-any.whl
- Upload date:
- Size: 197.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05da4751d0b9920757802ad2181e2ef5af871164cf89ec9dbfdb7f2accaa4b6b
|
|
| MD5 |
ffcffa40aee6cc47dfa15b0db3b834e5
|
|
| BLAKE2b-256 |
83c52ec7afe98e49f8ac018607049d084a6c1e616bc2a031d2dbea0ffd5b7ce2
|