Skip to main content

env_audit — a skill-based auditing system for Prime Intellect `verifiers` RL environments

Project description

env_audit

A skill-based auditing system for RL environments. Point an agent (Claude Code / Codex) at a verifiers environment from the Prime Intellect Hub and it runs six checks and produces a scorecard — before you spend GPU hours training on a broken reward.

RL environments are treated like training data, but nobody tests them first. A broken reward function doesn't crash — it silently teaches the policy garbage. env_audit catches that.

Why skills, not scripts

The six checks are judgment-heavy, non-deterministic evaluations — "does this reward agree with a competent grader?", "is the system prompt missing something?", "does this dataset overlap a benchmark?". Those are done well by an agent, not a hard-coded script. So each check is a skill file (skills/<check>/SKILL.md) that the agent reads and executes with its own reasoning, leaning on a small layer of deterministic tools (rlenv-audit ...) for the exact parts: loading the env, calling the reward function, running rollouts, rendering the scorecard.

Each check returns a score (0–100), a status, and a written justification.

The six checks

# Check Needs What it does
1 integrity Does it even run and is it shaped right: dataset loads & is well-formed, reward present & callable, follows verifiers conventions, no missing fields / broken imports.
2 problem-statement alignment (a problem statement) Given what the user says the env is for, judge whether the dataset + reward + prompt actually test that. N/A if no problem statement is provided.
3 reward design Stress-tests the reward without the policy: the agent writes ~20 synthetic completions (correct / wrong / edge / format perturbations), scores them through the real reward, and checks (a) the reward varies & discriminates sensibly and (b) each reward matches the agent's own judgment of quality.
4 latency model endpoint How long rollouts take end to end. Reads the shared cached rollouts.
5 rollout quality model endpoint Reads actual rollouts and judges whether the env is set up well in practice — system prompt right, outputs sensible, obvious env-caused failure modes.
6 contamination Infers the domain, picks the public benchmarks for it, and checks whether dataset instances match/near-match benchmark instances.

Shared rollouts (checks 4 & 5). Both need a model, so env_audit asks once which endpoint/model to use (or "dummy"), runs rollouts once (8 rollouts over ~20 samples, scored + timed, cached), and both checks read that single cache. Checks 1, 2, 3, 6 need no endpoint. No endpoint → 4 & 5 are N/A.

Quickstart

# Install the skills (pick one)
uvx --from git+https://github.com/vivekvkashyap/RLEnv_audit.git rlenv-audit install-skills
pip install git+https://github.com/vivekvkashyap/RLEnv_audit.git && rlenv-audit install-skills

Or as a Claude Code plugin, no terminal needed:

/plugin marketplace add vivekvkashyap/RLEnv_audit
/plugin install env-audit@rlenv-audit

Then point your agent (Claude Code / Codex) at an environment:

"Audit the gsm8k environment."   /   "Audit primeintellect/aime2024 — I'm trying to train a competition-math solver — using my vLLM at http://localhost:8000/v1."

That's it — everything else is self-bootstrapping: on the first audit the skill installs the rlenv-audit tools (if missing) and vf-installs the environment itself. The agent runs the six checks and prints the scorecard:

                               env_audit · gsm8k
┃ check             ┃ status ┃ score ┃ justification                           ┃
│ integrity         │ PASS   │    95 │ loads, reward callable, well-formed     │
│ problem_alignment │ N/A    │     — │ no problem statement provided           │
│ reward_design     │ PASS   │    88 │ discriminates; matches judgment 18/20   │
│ latency           │ N/A    │     — │ no endpoint                             │
│ rollout_quality   │ N/A    │     — │ no endpoint                             │
│ contamination     │ WARN   │    60 │ 3 near-matches with GSM8K test          │
overall: WARN   rating: B (81/100)

From a checkout (development)

pip install -e .                    # the rlenv-audit / env-audit tools
rlenv-audit install-skills          # copy skills/ into ~/.claude/skills
vf-install primeintellect/gsm8k     # install an environment to audit by hand

Most Hub envs require Python 3.11+; verifiers==0.1.14 (pinned) also runs on 3.10 for old-CUDA boxes, where you can install the older example envs. The env must be installed into the same Python environment as rlenv-audit — verifiers loads environments by importing them.

The tools (what the skills call)

rlenv-audit inspect <env> -n 20            # load + introspect -> JSON (reward source, samples, prompt)
rlenv-audit score <env> completions.json   # score agent-written completions through the reward fn
rlenv-audit rollouts <env> --endpoint <url> --model <m> -n 20 -k 8   # run+cache shared rollouts
rlenv-audit rollouts <env> --dummy         # fake rollouts, no endpoint (dry run)
rlenv-audit scorecard results.json         # render the final scorecard

These are deterministic and JSON-in/JSON-out — usable directly, but normally driven by the skills.

What good looks like

REWARD_DESIGN.md is the reference the reward-design and rollout-quality checks judge against — determinism, discrimination, baseline floor, partial credit, bounds, anti-hacking, parser contract, contamination.

Layout

skills/                 the six checks + the env-audit orchestrator (SKILL.md each)
.claude-plugin/         plugin + marketplace manifests (repo doubles as a Claude Code plugin)
rlenv_audit/
  adapters/verifiers.py EnvHandle — the only code that touches verifiers
  tools.py              inspect / score / rollouts / scorecard
  sandbox.py            Docker isolation (for executing risky completions)
  cli.py                the rlenv-audit / env-audit CLI (+ install-skills)
REWARD_DESIGN.md        the design guide the judgment checks cite

Development

pip install -e ".[dev]" && pytest tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rlenv_audit-0.3.0.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rlenv_audit-0.3.0-py3-none-any.whl (30.4 kB view details)

Uploaded Python 3

File details

Details for the file rlenv_audit-0.3.0.tar.gz.

File metadata

  • Download URL: rlenv_audit-0.3.0.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rlenv_audit-0.3.0.tar.gz
Algorithm Hash digest
SHA256 fcc77b02fdef4f43811f61e97271edd499c2aee66d9f7dd0bbfac089613198e7
MD5 396148257ef37e66a7db64895a3783c8
BLAKE2b-256 8936a46cc208a7888ed97014a9c387a897491e66eb745adbad01657b5804775b

See more details on using hashes here.

File details

Details for the file rlenv_audit-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: rlenv_audit-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 30.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rlenv_audit-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c07c44292957ec0ab0a6bbbd931cbd105db3dc2bdc3e4c129da530123756869e
MD5 d8c4f2fa78ec2b8805eeb458fa84bbef
BLAKE2b-256 054a5138a452f1f38c4b61bf0760380fa50704d33f44ce66be8f31226bed6391

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page