rlenv_audit, a skill-based auditing system for Prime Intellect `verifiers` RL environments

These details have not been verified by PyPI

Project links

Repository

Project description

rlenv_audit

rlenv_audit audits verifiers RL environments from the Prime Intellect Hub before you train on them. A broken reward function doesn't crash, it silently teaches the policy garbage. Point an agent (Claude Code / Codex) at an environment: it runs six checks and returns a scorecard out of 10 with written feedback on what to improve.

Quickstart

# Install the skills (pick one)
uvx --python 3.12 rlenv-audit install-skills
pip install rlenv-audit && rlenv-audit install-skills   # needs Python >= 3.11

Why --python 3.12: a Hub env must install into the same interpreter as the audit tool, and envs declare Python floors (most >=3.11, some higher) — a 3.12 venv clears nearly all of them in one go.

Then ask your agent, giving the full environment id (account/name; bare names like gsm8k are ambiguous on the Hub), your problem statement, and optionally a model endpoint and the HuggingFace datasets to check contamination against:

prompt

Audit primeintellect/gsm8k. I'm trying to train a grade-school math solver.
Check contamination against openai/gsm8k.

(in Claude Code or Codex)

If a vLLM server is up on the default address (http://localhost:8000/v1), the audit finds it by itself — endpoint and model name are auto-detected, and it tells you what it found. Serving somewhere else? Name it in the prompt: Use my vLLM endpoint at http://localhost:8000/v1, model Qwen2.5-7B. An explicitly named endpoint always wins; with no endpoint given and nothing on the default address, checks 4 & 5 are N/A.

Output

The scorecard, one row per check, each scored out of 10, plus one final score and written feedback:

                       rlenv_audit · primeintellect/gsm8k
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ check             ┃ status ┃ score ┃ justification                           ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ integrity         │ PASS   │   9.5 │ loads, reward callable, well-formed     │
│ problem_alignment │ PASS   │   9.0 │ dataset/reward match the stated goal    │
│ reward_design     │ PASS   │   8.8 │ discriminates; matches judgment 18/20   │
│ latency           │ PASS   │   8.5 │ mean 2.1s / p90 4.3s, no errors         │
│ rollout_quality   │ PASS   │   8.0 │ prompt clear; 6% truncated rollouts     │
│ contamination     │ WARN   │   6.0 │ 3 near-matches with openai/gsm8k test   │
└───────────────────┴────────┴───────┴─────────────────────────────────────────┘
overall: WARN   rating: 8.5/10

feedback
The environment is solidly built: it loads cleanly, the reward is a real
verifier (boxed-answer extraction + math equivalence, not a stub), and it
discriminates well: correct completions scored 1.0 and every wrong or
malformed probe scored 0.0, matching my own judgment on 18 of 20 cases.

The main thing to improve is contamination: 3 of the sampled training
instances near-match the openai/gsm8k test split you asked me to check, so
benchmark gains may partly be memorization; either dedupe against that test
split or report on a different set. Second, the parser only accepts \boxed{}
answers; consider accepting plain final-line answers too, or the policy gets
zero reward for correct-but-unformatted output early in training.

Final score: a weighted average out of 10 over the checks that ran (N/A carries no weight). Latency and contamination weigh 0.5 each, the other four checks 1.0.
Feedback: 1 to 3 paragraphs, what the env does right first, then what to improve, in priority order.
A FAIL on any check fails the audit.
The full report is also saved to rlenv_audit_reports/<account>__<name>/report.md (human-readable) and report.json (machine-readable) in your working directory, so you can commit it, share it, or diff it against a re-audit after fixes.

The six checks

#	Check	Needs	What it does
1	integrity	-	Does it even run and is it shaped right: dataset loads & is well-formed, reward present & callable, follows verifiers conventions, no missing fields / broken imports.
2	problem-statement alignment	-	Given your problem statement (a required input), judge whether the dataset + reward + prompt actually test that problem.
3	reward design	-	Stress-tests the reward without the policy: the agent writes ~20 synthetic completions (correct / wrong / edge / format perturbations), scores them through the real reward, and checks (a) the reward varies & discriminates sensibly and (b) each reward matches the agent's own judgment of quality.
4	latency	model endpoint	How long rollouts take end to end. Reads the shared cached rollouts.
5	rollout quality	model endpoint	Reads actual rollouts and judges whether the env is set up well in practice: system prompt right, outputs sensible, obvious env-caused failure modes.
6	contamination	HF dataset ids	Compares the env's dataset against the HuggingFace datasets you name (e.g. `openai/gsm8k`) and flags matching / near-matching instances. N/A (carries no weight) if you don't provide any.

Shared rollouts (checks 4 & 5). Both need a model, so rlenv_audit runs rollouts once through verifiers' own vf-eval engine (8 rollouts over ~20 samples, scored + timed, cached) and both checks read that single cache — the rollouts follow the env's real generation path, so multi-turn / tool envs roll out correctly. No endpoint → 4 & 5 are N/A.

Repair (opt-in)

If the audit comes back WARN/FAIL, ask for repairs explicitly — e.g. "rewrite the env based on the feedback". The env-repair skill applies the mechanical fixes (parser too strict, reward crashing on edge inputs, missing system prompt, unreachable termination, …) to a local copy under rlenv_audit_repairs/<account>__<name>/ — it never touches the installed package or the Hub. Design-level findings (misaligned dataset, contamination, difficulty) are left as written recommendations, reward-function edits are flagged loudly, every fix is validated against the repaired copy, and a REPAIRS.md documents what changed and why. Re-auditing the repaired copy and publishing it are yours.

Layout

skills/                 the six checks + the env-audit orchestrator + env-repair (SKILL.md each)
.claude-plugin/         plugin + marketplace manifests (repo doubles as a Claude Code plugin)
rlenv_audit/
  adapters/verifiers.py EnvHandle, the only code that touches verifiers
  tools.py              inspect / score / rollouts / scorecard
  sandbox.py            Docker isolation for scoring untrusted completions
  _sandbox_runner.py    the shim that runs INSIDE the container and scores
  cli.py                the rlenv-audit / env-audit CLI (+ install-skills)
DESIGN.md               architecture notes (adapter contract, verifiers quirks)
REWARD_DESIGN.md        the design guide the judgment checks cite

Development

pip install -e ".[dev]" && pytest tests/

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.5.2

Jun 12, 2026

0.5.1

Jun 12, 2026

0.5.0

Jun 12, 2026

0.4.0

Jun 11, 2026

0.3.6

Jun 11, 2026

0.3.5

Jun 11, 2026

0.3.4

Jun 11, 2026

0.3.3

Jun 11, 2026

0.3.2

Jun 11, 2026

0.3.1

Jun 11, 2026

0.3.0

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rlenv_audit-0.5.2.tar.gz (49.1 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rlenv_audit-0.5.2-py3-none-any.whl (46.9 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file rlenv_audit-0.5.2.tar.gz.

File metadata

Download URL: rlenv_audit-0.5.2.tar.gz
Upload date: Jun 12, 2026
Size: 49.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rlenv_audit-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`c0c6e40ab5270d60d84032db4123d93691864b4d12b3cf3e29e685f375ad03cd`
MD5	`6e5a5ccfb440e17d9f4b24de8f599d09`
BLAKE2b-256	`3d03390ffa709f91026c297c21f066e8a4880f81b647fd7584f90caae9912a94`

See more details on using hashes here.

File details

Details for the file rlenv_audit-0.5.2-py3-none-any.whl.

File metadata

Download URL: rlenv_audit-0.5.2-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 46.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rlenv_audit-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f128431af8b8b0ee71cb374d42de7c3238a6fdd5e746871237f647edbb9b2ed5`
MD5	`62e87630ed6736aea00491070dc233ad`
BLAKE2b-256	`f05b31e13c32b968a51be3bfb72884a50b8117db8d7e12689bd4213399c46c44`

See more details on using hashes here.

rlenv-audit 0.5.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

rlenv_audit

Quickstart

Output

The six checks

Repair (opt-in)

Layout

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes