Skip to main content

env_audit — a skill-based auditing system for Prime Intellect `verifiers` RL environments

Project description

env_audit

PyPI Python versions License

env_audit audits verifiers RL environments from the Prime Intellect Hub before you spend GPU hours training on them. RL environments are treated like training data, but nobody tests them first — a broken reward function doesn't crash, it silently teaches the policy garbage. env_audit catches that: point an agent (Claude Code / Codex) at an environment and it runs six judgment-based checks — each a skill file the agent executes, backed by a small deterministic tool layer — and returns a scorecard with a score (0–100), a status, and a written justification per check.

Quickstart

# Install the skills (pick one)
uvx rlenv-audit install-skills
pip install rlenv-audit && rlenv-audit install-skills

Then ask your agent (Claude Code / Codex), giving the environment name, your problem statement, and — if you have one — a model endpoint:

"Audit primeintellect/gsm8k. I'm trying to train a grade-school math solver. Use my vLLM endpoint at http://localhost:8000/v1, model Qwen2.5-7B."

That's the whole interface. Everything else is self-bootstrapping: on the first audit the skill installs the rlenv-audit tools (if missing) and vf-installs the environment itself. The problem statement is required (the agent asks if you don't give one); the endpoint is optional — without it the two rollout checks are reported N/A.

Output

The scorecard — one row per check with its status, score, and a one-line justification — plus the overall grade, a 0–100 rating with a letter, and a short prose summary of the biggest issue and what to fix first:

                               env_audit · gsm8k
┃ check             ┃ status ┃ score ┃ justification                           ┃
│ integrity         │ PASS   │    95 │ loads, reward callable, well-formed     │
│ problem_alignment │ PASS   │    90 │ dataset/reward match the stated goal    │
│ reward_design     │ PASS   │    88 │ discriminates; matches judgment 18/20   │
│ latency           │ N/A    │     — │ no endpoint                             │
│ rollout_quality   │ N/A    │     — │ no endpoint                             │
│ contamination     │ WARN   │    60 │ 3 near-matches with GSM8K test          │
overall: WARN   rating: B (83/100)

A FAIL on any check fails the audit. The rating averages only the checks that ran (N/A excluded).

The six checks

# Check Needs What it does
1 integrity Does it even run and is it shaped right: dataset loads & is well-formed, reward present & callable, follows verifiers conventions, no missing fields / broken imports.
2 problem-statement alignment Given your problem statement (a required input), judge whether the dataset + reward + prompt actually test that problem.
3 reward design Stress-tests the reward without the policy: the agent writes ~20 synthetic completions (correct / wrong / edge / format perturbations), scores them through the real reward, and checks (a) the reward varies & discriminates sensibly and (b) each reward matches the agent's own judgment of quality.
4 latency model endpoint How long rollouts take end to end. Reads the shared cached rollouts.
5 rollout quality model endpoint Reads actual rollouts and judges whether the env is set up well in practice — system prompt right, outputs sensible, obvious env-caused failure modes.
6 contamination Infers the domain, picks the public benchmarks for it, and checks whether dataset instances match/near-match benchmark instances.

Shared rollouts (checks 4 & 5). Both need a model, so env_audit runs rollouts once (8 rollouts over ~20 samples, scored + timed, cached) and both checks read that single cache. No endpoint → 4 & 5 are N/A.

Layout

skills/                 the six checks + the env-audit orchestrator (SKILL.md each)
.claude-plugin/         plugin + marketplace manifests (repo doubles as a Claude Code plugin)
rlenv_audit/
  adapters/verifiers.py EnvHandle — the only code that touches verifiers
  tools.py              inspect / score / rollouts / scorecard
  sandbox.py            Docker isolation (for executing risky completions)
  cli.py                the rlenv-audit / env-audit CLI (+ install-skills)
REWARD_DESIGN.md        the design guide the judgment checks cite

Development

pip install -e ".[dev]" && pytest tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rlenv_audit-0.3.1.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rlenv_audit-0.3.1-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file rlenv_audit-0.3.1.tar.gz.

File metadata

  • Download URL: rlenv_audit-0.3.1.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rlenv_audit-0.3.1.tar.gz
Algorithm Hash digest
SHA256 bbc3180bf507286a2a323845eaea39dba1ca58688a252a91f7d882beb466496a
MD5 ca038107eb548952529788e9a3acd7d7
BLAKE2b-256 9383f4f678dc3c11ffacd2c9b98174d8c5608fd3f63ae0ddad53dbd914990ee7

See more details on using hashes here.

File details

Details for the file rlenv_audit-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: rlenv_audit-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rlenv_audit-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ae660412cdfb5a8be7741865e6f688d9f76cf83c48f117e3c5d56dfa52f0a338
MD5 29624001349cf0806e9eed261f41ef55
BLAKE2b-256 754afad687644d79fbbd5ca759fd0e68cd9bf713e9406ce4c8c221aa1b84f742

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page