Skip to main content

env_audit — a skill-based auditing system for Prime Intellect `verifiers` RL environments

Project description

env_audit

PyPI Python versions License

env_audit audits verifiers RL environments from the Prime Intellect Hub before you spend GPU hours training on them. RL environments are treated like training data, but nobody tests them first — a broken reward function doesn't crash, it silently teaches the policy garbage. env_audit catches that: point an agent (Claude Code / Codex) at an environment and it runs six judgment-based checks — each a skill file the agent executes, backed by a small deterministic tool layer — and returns a scorecard with a score out of 10, a status, and a written justification per check, plus overall feedback on what the env does right and what to improve.

Quickstart

# Install the skills (pick one)
uvx rlenv-audit install-skills
pip install rlenv-audit && rlenv-audit install-skills

Then ask your agent (Claude Code / Codex), giving the full environment id (account/name — bare names like gsm8k are ambiguous on the Hub), your problem statement, and optionally a model endpoint and the HuggingFace datasets to check contamination against:

Audit primeintellect/gsm8k. I'm trying to train a grade-school math solver.
Use my vLLM endpoint at http://localhost:8000/v1, model Qwen2.5-7B.
Check contamination against openai/gsm8k.

Output

The scorecard — one row per check, each scored out of 10 — plus one final score and written feedback:

                               env_audit · gsm8k
┃ check             ┃ status ┃ score ┃ justification                           ┃
│ integrity         │ PASS   │   9.5 │ loads, reward callable, well-formed     │
│ problem_alignment │ PASS   │   9.0 │ dataset/reward match the stated goal    │
│ reward_design     │ PASS   │   8.8 │ discriminates; matches judgment 18/20   │
│ latency           │ N/A    │     — │ no endpoint                             │
│ rollout_quality   │ N/A    │     — │ no endpoint                             │
│ contamination     │ WARN   │   6.0 │ 3 near-matches with openai/gsm8k test   │
overall: WARN   rating: 8.7/10

feedback
The environment is solidly built: it loads cleanly, the reward is a real
verifier (boxed-answer extraction + math equivalence, not a stub), and it
discriminates well — correct completions scored 1.0 and every wrong or
malformed probe scored 0.0, matching my own judgment on 18 of 20 cases.

The main thing to improve is contamination: 3 of the sampled training
instances near-match the openai/gsm8k test split you asked me to check, so
benchmark gains may partly be memorization — either dedupe against that test
split or report on a different set. Second, the parser only accepts \boxed{}
answers; consider
accepting plain final-line answers too, or the policy gets zero reward for
correct-but-unformatted output early in training.
  • Final score — a weighted average out of 10 over the checks that ran (N/A carries no weight). Latency and contamination weigh 0.5 each, the other four checks 1.0.
  • Feedback — 1–3 paragraphs: what the env does right first, then what to improve, in priority order.
  • A FAIL on any check fails the audit.

The six checks

# Check Needs What it does
1 integrity Does it even run and is it shaped right: dataset loads & is well-formed, reward present & callable, follows verifiers conventions, no missing fields / broken imports.
2 problem-statement alignment Given your problem statement (a required input), judge whether the dataset + reward + prompt actually test that problem.
3 reward design Stress-tests the reward without the policy: the agent writes ~20 synthetic completions (correct / wrong / edge / format perturbations), scores them through the real reward, and checks (a) the reward varies & discriminates sensibly and (b) each reward matches the agent's own judgment of quality.
4 latency model endpoint How long rollouts take end to end. Reads the shared cached rollouts.
5 rollout quality model endpoint Reads actual rollouts and judges whether the env is set up well in practice — system prompt right, outputs sensible, obvious env-caused failure modes.
6 contamination HF dataset ids Compares the env's dataset against the HuggingFace datasets you name (e.g. openai/gsm8k) and flags matching / near-matching instances. N/A — and carries no weight — if you don't provide any.

Shared rollouts (checks 4 & 5). Both need a model, so env_audit runs rollouts once (8 rollouts over ~20 samples, scored + timed, cached) and both checks read that single cache. No endpoint → 4 & 5 are N/A.

Layout

skills/                 the six checks + the env-audit orchestrator (SKILL.md each)
.claude-plugin/         plugin + marketplace manifests (repo doubles as a Claude Code plugin)
rlenv_audit/
  adapters/verifiers.py EnvHandle — the only code that touches verifiers
  tools.py              inspect / score / rollouts / scorecard
  sandbox.py            Docker isolation (for executing risky completions)
  cli.py                the rlenv-audit / env-audit CLI (+ install-skills)
REWARD_DESIGN.md        the design guide the judgment checks cite

Development

pip install -e ".[dev]" && pytest tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rlenv_audit-0.3.3.tar.gz (34.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rlenv_audit-0.3.3-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file rlenv_audit-0.3.3.tar.gz.

File metadata

  • Download URL: rlenv_audit-0.3.3.tar.gz
  • Upload date:
  • Size: 34.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rlenv_audit-0.3.3.tar.gz
Algorithm Hash digest
SHA256 211251b0899b272e55b3ce87163e4955480fb6235bd357f5e7af80e4210afe0b
MD5 e25af768b1e813f0104556d4f01a923c
BLAKE2b-256 8dd99abcf4d167288fe42160e645dbf85628cd25ad6fac499a9019e378936981

See more details on using hashes here.

File details

Details for the file rlenv_audit-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: rlenv_audit-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rlenv_audit-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2121330e8af4b790b7581278315a4e10c7b4dd6f7d889bfb80e21ad274227aec
MD5 8c582e456d534949082947b6f336c79c
BLAKE2b-256 451b8dc660b7a2dbf1ab4a11258263a50e573055cd0f0cc2ce8f299edb8f4799

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page