rlenv_audit, a skill-based auditing system for Prime Intellect `verifiers` RL environments
Project description
rlenv_audit
rlenv_audit audits verifiers RL environments from the Prime Intellect Hub before you train on them. A broken reward function doesn't crash, it silently teaches the policy garbage. Point an agent (Claude Code / Codex) at an environment: it runs six checks and returns a scorecard out of 10 with written feedback on what to improve.
Quickstart
# Install the skills (pick one)
uvx rlenv-audit install-skills
pip install rlenv-audit && rlenv-audit install-skills
Then ask your agent, giving the full environment id (account/name; bare
names like gsm8k are ambiguous on the Hub), your problem statement, and
optionally a model endpoint and the HuggingFace datasets to check
contamination against:
prompt
Audit primeintellect/gsm8k. I'm trying to train a grade-school math solver.
Use my vLLM endpoint at http://localhost:8000/v1, model Qwen2.5-7B.
Check contamination against openai/gsm8k.
(in Claude Code or Codex)
Output
The scorecard, one row per check, each scored out of 10, plus one final score and written feedback:
rlenv_audit · primeintellect/gsm8k
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ check ┃ status ┃ score ┃ justification ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ integrity │ PASS │ 9.5 │ loads, reward callable, well-formed │
│ problem_alignment │ PASS │ 9.0 │ dataset/reward match the stated goal │
│ reward_design │ PASS │ 8.8 │ discriminates; matches judgment 18/20 │
│ latency │ N/A │ — │ no endpoint │
│ rollout_quality │ N/A │ — │ no endpoint │
│ contamination │ WARN │ 6.0 │ 3 near-matches with openai/gsm8k test │
└───────────────────┴────────┴───────┴─────────────────────────────────────────┘
overall: WARN rating: 8.7/10
feedback
The environment is solidly built: it loads cleanly, the reward is a real
verifier (boxed-answer extraction + math equivalence, not a stub), and it
discriminates well: correct completions scored 1.0 and every wrong or
malformed probe scored 0.0, matching my own judgment on 18 of 20 cases.
The main thing to improve is contamination: 3 of the sampled training
instances near-match the openai/gsm8k test split you asked me to check, so
benchmark gains may partly be memorization; either dedupe against that test
split or report on a different set. Second, the parser only accepts \boxed{}
answers; consider
accepting plain final-line answers too, or the policy gets zero reward for
correct-but-unformatted output early in training.
- Final score: a weighted average out of 10 over the checks that ran (N/A carries no weight). Latency and contamination weigh 0.5 each, the other four checks 1.0.
- Feedback: 1 to 3 paragraphs, what the env does right first, then what to improve, in priority order.
- A
FAILon any check fails the audit. - The full report is also saved to
rlenv_audit_reports/<account>__<name>/report.md(human-readable) andreport.json(machine-readable) in your working directory, so you can commit it, share it, or diff it against a re-audit after fixes.
The six checks
| # | Check | Needs | What it does |
|---|---|---|---|
| 1 | integrity | - | Does it even run and is it shaped right: dataset loads & is well-formed, reward present & callable, follows verifiers conventions, no missing fields / broken imports. |
| 2 | problem-statement alignment | - | Given your problem statement (a required input), judge whether the dataset + reward + prompt actually test that problem. |
| 3 | reward design | - | Stress-tests the reward without the policy: the agent writes ~20 synthetic completions (correct / wrong / edge / format perturbations), scores them through the real reward, and checks (a) the reward varies & discriminates sensibly and (b) each reward matches the agent's own judgment of quality. |
| 4 | latency | model endpoint | How long rollouts take end to end. Reads the shared cached rollouts. |
| 5 | rollout quality | model endpoint | Reads actual rollouts and judges whether the env is set up well in practice: system prompt right, outputs sensible, obvious env-caused failure modes. |
| 6 | contamination | HF dataset ids | Compares the env's dataset against the HuggingFace datasets you name (e.g. openai/gsm8k) and flags matching / near-matching instances. N/A (carries no weight) if you don't provide any. |
Shared rollouts (checks 4 & 5). Both need a model, so rlenv_audit runs rollouts once (8 rollouts over ~20 samples, scored + timed, cached) and both checks read that single cache. No endpoint → 4 & 5 are N/A.
Layout
skills/ the six checks + the env-audit orchestrator (SKILL.md each)
.claude-plugin/ plugin + marketplace manifests (repo doubles as a Claude Code plugin)
rlenv_audit/
adapters/verifiers.py EnvHandle, the only code that touches verifiers
tools.py inspect / score / rollouts / scorecard
sandbox.py Docker isolation (for executing risky completions)
cli.py the rlenv-audit / env-audit CLI (+ install-skills)
REWARD_DESIGN.md the design guide the judgment checks cite
Development
pip install -e ".[dev]" && pytest tests/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rlenv_audit-0.4.0.tar.gz.
File metadata
- Download URL: rlenv_audit-0.4.0.tar.gz
- Upload date:
- Size: 37.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7a2109417744ea71528c822646cd75ef812345e1a82b27ec0e1b3991725a5e3
|
|
| MD5 |
c9f7ba009fdf67cc0573515d33f39733
|
|
| BLAKE2b-256 |
d3c57445454d33d33f801eb6bed802d5541459a4d83264c1811229876ed77cd4
|
File details
Details for the file rlenv_audit-0.4.0-py3-none-any.whl.
File metadata
- Download URL: rlenv_audit-0.4.0-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7bbcf4f35d401db9de7168e5a95852988d9f49dc9dd225c7eadc9f8ff77ccea
|
|
| MD5 |
2a77d17d68931ec2c750a023050e03b0
|
|
| BLAKE2b-256 |
2ae8ea427e42db41c25ee860dc872b8714f18eb0b23a5ada89131eea32e267ac
|