Turn any repository into an RL environment for training and evaluation.
Project description
Repo2RLEnv
Turn any repository into an RL environment for training and evaluation.
⚠️ Experimental. This is a research project in active development. APIs, spec fields, and CLI flags change between minor versions. Pin to a specific release if you depend on it; expect breaking changes on
main.
Repo2RLEnv synthesizes verifiable data from existing repositories using pluggable pipelines, exports it into a uniform spec, and pushes straight to the Hugging Face Hub. End-to-end — synthesis → standardize → train + eval — with the focus on training. The uniform spec is Harbor's, so the datasets you produce drop straight into any Harbor-compatible runtime.
╭──────────────╮ ╭──────────────╮ ╭──────────────╮ ╭──────────────────╮
│ any │ ──▶ │ synthesize │ ──▶ │ uniform spec │ ──▶ │ train · eval · │
│ repo │ │ (pipelines) │ │ (Harbor) │ │ push to HF Hub │
╰──────────────╯ ╰──────────────╯ ╰──────────────╯ ╰──────────────────╯
└──────────────────────── Repo2RLEnv ────────────────────────┘
Quickstart
# Install (pick one)
uv add repo2rlenv # add to a uv-managed project
uvx repo2rlenv --help # one-shot, no install
pip install repo2rlenv # classic
# Auth: nothing to set up if you've done `gh auth login` and `huggingface-cli login`
# Otherwise: export GITHUB_TOKEN=... ; export HF_TOKEN=...
# Generate a dataset locally
repo2rlenv generate \
--repo <owner>/<repo> \
--pipeline pr_diff \
--pipeline-opt limit=5 \
--llm anthropic/claude-sonnet-4-6 \
--out ./datasets/<dataset-name>
# Or push straight to HF Hub with --out hf://<your-org>/<dataset-name>
# Validate a local dataset against the spec
repo2rlenv validate ./path/to/dataset
# Score a candidate diff against a task's oracle (diff-similarity reward)
repo2rlenv reward --task ./datasets/<dataset-name>/<task-id> --prediction ./candidate.diff
# Or write a sample config first and use --config
repo2rlenv init && repo2rlenv generate --config repo2rlenv.config.yaml
Full walkthrough in docs/quickstart.md.
Pipelines
Different methods to manufacture verifiable tasks from a repo. Pick one, run it, push the dataset.
| Pipeline | Status | Sandbox | Inspiration | Docs |
|---|---|---|---|---|
pr_diff |
✅ | — | SWE-RL | 📄 |
pr_runtime |
✅ | ✓ | SWE-bench | 📄 |
commit_runtime |
planned | ✓ | R2E-Gym SWE-GEN | 📄 |
mutation_bugs |
planned | ✓ | SWE-smith | 📄 |
code_instruct |
planned | ✓ | Magicoder / OSS-Instruct | 📄 |
equivalence_tests |
planned | ✓ | R2E | 📄 |
pr_stream |
planned | ✓ | SWE-bench-Live | 📄 |
cve_patches |
planned | ✓ | PatchSeeker / CVE-Bench | 📄 |
refactor_synthesis |
planned | ✓ | RefactoringMiner | 📄 |
Every pipeline flows through the same QA gate (determinism, oracle consistency, LLM judge, false-negative filter) before tasks are admitted to a dataset. Text-only pipelines skip the heavy QA layers since there's no execution to validate. See docs/pipelines/README.md for the full status table including reward kinds + GPU requirements.
Bootstrap (sandbox-required pipelines)
Pipelines marked with a sandbox ✓ above need a working Docker environment for the target repo before they can run. Repo2RLEnv's bootstrap phase handles this automatically — an LLM agent iterates shell commands inside a fresh Docker container until the repo builds and the test suite collects. The working image is committed, content-addressed, and cached, so the expensive env-construction step runs once per (repo, ref) and every downstream task reuses it. Pure text pipelines (pr_diff) skip it entirely.
You don't normally invoke it directly — repo2rlenv generate --pipeline pr_runtime ... auto-triggers a cache lookup and runs bootstrap on miss. But you can pre-warm it or use it standalone for debugging:
repo2rlenv bootstrap \
--repo <owner>/<repo> \
--llm anthropic/claude-sonnet-4-6
Full design + cache layout + cost-tracking + spec extension fields: docs/reference/BOOTSTRAP.md.
What you get out
A dataset format that:
- Is verifiable — every task carries either an executable test (
test_execution) or a stored oracle diff (diff_similarity); your trainer picks the reward type - Is content-addressed —
content_hashover each task; same artifacts ⇒ same hash - Trains anywhere via Harbor — TRL, SkyRL, Prime-RL, Tinker, Miles, Slime, harbor.rl
- Evaluates with any agent harness — Claude Code, OpenHands, Codex CLI, Gemini CLI, …
- Is language-agnostic by spec —
_runtimepipelines emit Dockerfile + shell verifier;_diffpipelines are pure text and work for any language with no extra config - Publishes natively to Hugging Face Hub —
--out hf://owner/namewrites a Harbor-compatibleregistry.jsonso consumers canharbor downloadwithout any glue - Supports private repos end-to-end —
gh auth tokenresolved automatically; build secrets declared by name; verifier-time secrets forbidden by spec
Under the hood
Repo2RLEnv emits datasets in the Harbor task format. We don't ship our own sandbox, agent harness, or registry — Harbor already has those. We focus on synthesis: turning a real repo into verifiable, reproducible Harbor tasks. A small [metadata.repo2env] extension inside Harbor's task.toml carries provenance (pipeline name, base commit, PR URL, content hash, reward kinds, etc.).
By targeting Harbor we inherit its full stack: Local Docker / Modal / Daytona / E2B / Runloop sandboxes, every major coding-agent harness, parallel execution, the publishing CLI, and downstream hooks for OpenReward (which adds Miles, Slime to the trainer list).
Documentation
Docs are organized into three tiers — see docs/README.md for the index.
- 🚀
docs/quickstart.md— install → first dataset → push to Hub, in 10 minutes - 📖
docs/pipelines/— one page per synthesis pipeline (status, when to use, oracle shape, inspiration) - 📚 Reference contracts and module-level API:
reference/SPEC.md— input/output contractreference/API.md— Python API forsrc/repo2rlenv/reference/AUTH.md— GitHub / HF / LLM auth resolutionreference/BOOTSTRAP.md— LLM-iterated per-repo Docker imagereference/AGENTS.md— Harbor agent harnesses + RL trace plumbing
- 🛠
CONTRIBUTING.md— dev setup, PR conventions, commit style, release flow - 🧪
contributing/ADDING_A_PIPELINE.md— step-by-step cookbook for shipping a new pipeline
Adjacent projects
Beyond the per-pipeline inspirations linked in the table above, Repo2RLEnv builds on or adjacent to:
- Harbor — the task format + runtime ecosystem we adopt as our output spec
- RepoLaunch (Microsoft) — LLM-agent-driven environment setup; our
bootstrapis an independent reimplementation - OpenReward — ORS protocol + extra trainer integrations layered above Harbor
- SWE-Gym — RL-environment framing for SWE-bench-style tasks
- SWE-Bench++ — four-stage QA pipeline we'll re-implement
- verifiers (Prime Intellect), OpenEnv (Meta + HF) — adjacent standardization efforts
Every pipeline that draws from external work carries an Acknowledgment block in its .py file. No code is copied — implementations are independent and licensed Apache-2.0.
Status
Pre-alpha.
- v0.1 shipped on PyPI:
pr_diff+ HF Hub publish + diff-similarity reward, end-to-end on any GitHub repo (public or private). - v0.2 in main: bootstrap phase (LLM-driven Docker env), unified Rich UI, content-addressed cache, registry-qualified pullable digests.
- v0.3 in main:
pr_runtimepipeline (sandbox-verified PR mining withFAIL_TO_PASS/PASS_TO_PASSoracle), auto-triggered bootstrap, structural quality filters (ci_only_patch,no_new_test_funcs, path-component test classifier), targeted test invocation. 115/115 tests passing. - v0.4 planned: polyglot log parsers (JS/Go/Rust), parallel per-PR validation, LLM-judged QA gate (SWE-Bench++ four-layer recipe).
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file repo2rlenv-0.4.0.tar.gz.
File metadata
- Download URL: repo2rlenv-0.4.0.tar.gz
- Upload date:
- Size: 75.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82174b130e3c42f7ac9c6a8f498fb737a5df684d445374281fc2f58fd93f0cb5
|
|
| MD5 |
7b373f149c072b56dc055aa812b1b4bb
|
|
| BLAKE2b-256 |
dafdcfe357f5fa6b168ea169b89aa6202a5d044ca16fba114e4312c83b0c9c2e
|
File details
Details for the file repo2rlenv-0.4.0-py3-none-any.whl.
File metadata
- Download URL: repo2rlenv-0.4.0-py3-none-any.whl
- Upload date:
- Size: 96.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1dbdc1bfeb487d91a9bd71048e63dbe1b14b1265973068735051fd256fbac37
|
|
| MD5 |
267042038cb53c7c2eb778d5963679f6
|
|
| BLAKE2b-256 |
6d386f44a3dca15ced2d2a1d596663435992c5aa0b6563486f756267d4fa227c
|