env_audit — a skill-based auditing system for Prime Intellect `verifiers` RL environments
Project description
env_audit
env_audit audits verifiers
RL environments from the Prime Intellect Hub before you spend GPU hours
training on them. RL environments are treated like training data, but nobody
tests them first — a broken reward function doesn't crash, it silently teaches
the policy garbage. env_audit catches that: point an agent (Claude Code / Codex)
at an environment and it runs six judgment-based checks — each a skill file
the agent executes, backed by a small deterministic tool layer — and returns a
scorecard with a score (0–100), a status, and a written justification per check.
Quickstart
# Install the skills (pick one)
uvx rlenv-audit install-skills
pip install rlenv-audit && rlenv-audit install-skills
Then ask your agent (Claude Code / Codex), giving the environment name, your problem statement, and — if you have one — a model endpoint:
"Audit
primeintellect/gsm8k. I'm trying to train a grade-school math solver. Use my vLLM endpoint athttp://localhost:8000/v1, modelQwen2.5-7B."
That's the whole interface. Everything else is self-bootstrapping: on the first
audit the skill installs the rlenv-audit tools (if missing) and vf-installs
the environment itself. The problem statement is required (the agent asks if
you don't give one); the endpoint is optional — without it the two rollout
checks are reported N/A.
Output
The scorecard — one row per check with its status, score, and a one-line justification — plus the overall grade, a 0–100 rating with a letter, and a short prose summary of the biggest issue and what to fix first:
env_audit · gsm8k
┃ check ┃ status ┃ score ┃ justification ┃
│ integrity │ PASS │ 95 │ loads, reward callable, well-formed │
│ problem_alignment │ PASS │ 90 │ dataset/reward match the stated goal │
│ reward_design │ PASS │ 88 │ discriminates; matches judgment 18/20 │
│ latency │ N/A │ — │ no endpoint │
│ rollout_quality │ N/A │ — │ no endpoint │
│ contamination │ WARN │ 60 │ 3 near-matches with GSM8K test │
overall: WARN rating: B (83/100)
A FAIL on any check fails the audit. The rating averages only the checks that
ran (N/A excluded).
The six checks
| # | Check | Needs | What it does |
|---|---|---|---|
| 1 | integrity | — | Does it even run and is it shaped right: dataset loads & is well-formed, reward present & callable, follows verifiers conventions, no missing fields / broken imports. |
| 2 | problem-statement alignment | — | Given your problem statement (a required input), judge whether the dataset + reward + prompt actually test that problem. |
| 3 | reward design | — | Stress-tests the reward without the policy: the agent writes ~20 synthetic completions (correct / wrong / edge / format perturbations), scores them through the real reward, and checks (a) the reward varies & discriminates sensibly and (b) each reward matches the agent's own judgment of quality. |
| 4 | latency | model endpoint | How long rollouts take end to end. Reads the shared cached rollouts. |
| 5 | rollout quality | model endpoint | Reads actual rollouts and judges whether the env is set up well in practice — system prompt right, outputs sensible, obvious env-caused failure modes. |
| 6 | contamination | — | Infers the domain, picks the public benchmarks for it, and checks whether dataset instances match/near-match benchmark instances. |
Shared rollouts (checks 4 & 5). Both need a model, so env_audit runs rollouts once (8 rollouts over ~20 samples, scored + timed, cached) and both checks read that single cache. No endpoint → 4 & 5 are N/A.
Layout
skills/ the six checks + the env-audit orchestrator (SKILL.md each)
.claude-plugin/ plugin + marketplace manifests (repo doubles as a Claude Code plugin)
rlenv_audit/
adapters/verifiers.py EnvHandle — the only code that touches verifiers
tools.py inspect / score / rollouts / scorecard
sandbox.py Docker isolation (for executing risky completions)
cli.py the rlenv-audit / env-audit CLI (+ install-skills)
REWARD_DESIGN.md the design guide the judgment checks cite
Development
pip install -e ".[dev]" && pytest tests/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rlenv_audit-0.3.1.tar.gz.
File metadata
- Download URL: rlenv_audit-0.3.1.tar.gz
- Upload date:
- Size: 33.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbc3180bf507286a2a323845eaea39dba1ca58688a252a91f7d882beb466496a
|
|
| MD5 |
ca038107eb548952529788e9a3acd7d7
|
|
| BLAKE2b-256 |
9383f4f678dc3c11ffacd2c9b98174d8c5608fd3f63ae0ddad53dbd914990ee7
|
File details
Details for the file rlenv_audit-0.3.1-py3-none-any.whl.
File metadata
- Download URL: rlenv_audit-0.3.1-py3-none-any.whl
- Upload date:
- Size: 30.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae660412cdfb5a8be7741865e6f688d9f76cf83c48f117e3c5d56dfa52f0a338
|
|
| MD5 |
29624001349cf0806e9eed261f41ef55
|
|
| BLAKE2b-256 |
754afad687644d79fbbd5ca759fd0e68cd9bf713e9406ce4c8c221aa1b84f742
|