Skip to main content

CodexOpt: Improve AGENTS.md and Skills for Codex with SkillOpt-style validation

Project description

CodexOpt logo

Benchmark and optimize AGENTS.md and SKILL.md for Codex.

CodexOpt

PyPI version Python Docs Demo Repo License

View Documentation Try Demo Install from PyPI

CodexOpt is a lightweight CLI for benchmarking and optimizing Codex instruction assets.

It focuses on Codex instruction assets:

  • AGENTS.md
  • .codex/skills/**/SKILL.md
  • .agents/skills/**/SKILL.md

Quick Links

CodexOpt gives teams a repeatable workflow to:

  1. Scan instruction files.
  2. Benchmark quality.
  3. Generate optimized candidates.
  4. Apply only improvements.
  5. Produce a report.

Why CodexOpt

Most teams edit AGENTS.md and SKILL.md manually, but struggle to answer:

  • Did quality actually improve?
  • Did we increase prompt bloat?
  • Did we break skill frontmatter conventions?

CodexOpt turns these edits into measurable runs with artifacts you can inspect and version.

Features

  • Project scan with issue detection for agents and skills.
  • Benchmark scoring with sub-scores and natural-language feedback.
  • Optional evidence inputs from repo task files and issue exports.
  • Optimization engine heuristic (default, local and deterministic).
  • Reflective engine for Codex-backed SkillOpt/GEPA-style optimization.
  • SkillOpt-inspired skillopt engine for SKILL.md files with train/validation evidence splits, bounded edits, and validation-gated acceptance.
  • Explicit reporting when a model-backed run falls back to heuristic optimization.
  • Safe apply flow with automatic backups.
  • Markdown reporting from latest runs.
  • Minimal OSS CI (lint, test, build).

Installation

Requirements

  • Python >=3.10
  • uv (recommended) or pip

Recommended: uv (full workflow)

uv sync --extra dev

Run commands through the managed environment:

uv run codexopt --help

uv.lock is committed to keep dependency resolution reproducible across machines and CI.

Alternative: pip

pip install -e ".[dev]"

Quick Start (uv)

# 1) Create config
uv run codexopt init

# 2) Inspect what will be evaluated
uv run codexopt scan

# 3) Get baseline scores
uv run codexopt benchmark

# 4) Optimize AGENTS.md
uv run codexopt optimize agents --file AGENTS.md

# 5) Optimize skills
uv run codexopt optimize skills --glob ".codex/skills/**/SKILL.md"

# 6) Review apply impact without writing
uv run codexopt apply --kind agents --dry-run

# 7) Apply selected improvements
uv run codexopt apply --kind agents

# 8) Generate markdown summary
uv run codexopt report --output codexopt-report.md

For Codex-specific rollout workflows, including codex exec --json validation tasks, see Using CodexOpt with Codex.

How Teams Use CodexOpt

Developers use CodexOpt in the repository that contains their Codex instruction assets:

  • AGENTS.md
  • .codex/skills/**/SKILL.md
  • .agents/skills/**/SKILL.md

Optional evidence can also be added to improve benchmarking and optimization quality:

  • task files (tasks.md, task lists, or JSON fixtures)
  • issue/review exports (issues.md or JSON exports)

Typical workflow:

  1. Run scan and benchmark to measure the current instruction assets.
  2. Run optimize agents and optimize skills to generate improved candidates.
  3. Review the generated diffs and report artifacts under .codexopt/runs/.
  4. Run apply --dry-run first, then apply accepted changes.
  5. Commit the updated instruction files and, if useful, attach the report to a PR.

Example with optional evidence configured in codexopt.yaml:

evidence:
  task_files:
    - tasks.md
  issue_files:
    - issues.md

With that config in place, benchmark and optimize use:

  • static prompt-quality checks
  • repo task alignment
  • recurring issue/review themes

Today, task and issue files influence scoring and feedback. With --engine skillopt, CodexOpt uses task evidence as train/validation splits so skill candidates must improve held-out evidence before they are accepted. JSON task files can also define executable rollout commands; when present, those rollout pass rates become the held-out validation gate.

Use codexopt.example.yaml as a starting point for committed team config.

Command Reference

Global options

codexopt --config <path-to-codexopt.yaml> <command>

init

Create a default config file.

codexopt init [--path PATH] [--force]

scan

Discover AGENTS/SKILL targets and validate shape.

codexopt scan

benchmark

Score current files using built-in heuristics.

codexopt benchmark

optimize agents

Optimize AGENTS files.

codexopt optimize agents \
  [--file PATTERN] \
  [--engine heuristic|reflective] \
  [--reflection-model MODEL] \
  [--max-metric-calls N]

optimize skills

Optimize SKILL files.

codexopt optimize skills \
  [--glob PATTERN] \
  [--engine heuristic|skillopt|reflective] \
  [--reflection-model MODEL] \
  [--max-metric-calls N]

improve

One command for Codex users: discover targets, mine starter tasks, run the reflective optimizer, and preview the diff.

codexopt improve                    # offline preview
codexopt improve --live             # Codex-backed reflective preview
codexopt improve --live --apply     # write validated changes with backups

apply

Apply best candidates from the latest optimization run (or a provided run id).

codexopt apply [--kind agents|skills] [--run-id RUN_ID] [--dry-run]

report

Generate a markdown report from latest runs in state.

codexopt report [--output FILE.md]

Configuration

Default codexopt.yaml:

version: 1
targets:
  agents_files:
    - AGENTS.md
    - "**/AGENTS.md"
    - "**/AGENTS.override.md"
  skills_globs:
    - ".codex/skills/**/SKILL.md"
    - "**/.codex/skills/**/SKILL.md"
    - ".agents/skills/**/SKILL.md"
    - "**/.agents/skills/**/SKILL.md"
  exclude_globs:
    - ".git/**"
    - ".codexopt/**"
    - ".venv/**"
    - "node_modules/**"
    - "reference/**"
output:
  root_dir: ".codexopt"
evidence:
  task_files: []
  issue_files: []
optimization:
  engine: "heuristic"
  min_apply_delta: 0.01
  max_metric_calls: 60
  reflection_model: null
  skillopt_train_ratio: 0.67
  skillopt_edit_budget: 24
  skillopt_validation_delta: 0.01

Config notes:

  • targets.agents_files: glob patterns for AGENTS targets.
  • targets.skills_globs: glob patterns for SKILL.md targets.
  • targets.exclude_globs: paths ignored during scan.
  • output.root_dir: run artifacts and backups location.
  • evidence.task_files: optional markdown/json task lists used for repo-alignment scoring.
  • evidence.issue_files: optional markdown/json issue or review exports used for theme-aware feedback.
  • optimization.engine: default optimization engine (heuristic, reflective, or skillopt for skills).
  • optimization.min_apply_delta: minimum score gain required to apply.
  • optimization.max_metric_calls: legacy GEPA metric budget.
  • optimization.reflection_model: legacy GEPA reflection model.
  • optimization.skillopt_train_ratio: task evidence fraction used for skill candidate proposal.
  • optimization.skillopt_edit_budget: maximum line edit operations allowed for SkillOpt candidates.
  • optimization.skillopt_validation_delta: minimum held-out validation gain required for SkillOpt acceptance.

How Scoring Works

CodexOpt computes a 0.0 to 1.0 score per file.

AGENTS scoring factors include:

  • Too short or too long content penalties.
  • Token-heaviness estimate penalty.
  • Empty file penalty.
  • Contradictory guidance penalties.
  • Missing workflow / verification / output-format guidance penalties.
  • Repo-context and task-alignment signals when evidence files are configured.

SKILL scoring factors include:

  • Missing frontmatter penalties.
  • Missing name / description penalties.
  • Overly long frontmatter fields penalties.
  • Too short or too long content penalties.
  • Weak trigger/workflow/verification guidance penalties.
  • Repo task alignment signals when evidence files are configured.

Each benchmarked file also includes:

  • criterion-level sub-scores
  • natural-language feedback
  • optional evidence summary from configured task/issue files

Optimization Behavior

Heuristic engine

Candidate transforms include:

  • Whitespace normalization.
  • Blank-line compaction.
  • Duplicate adjacent line removal.
  • Skill-specific frontmatter synthesis/trimming.

The best candidate is selected by score delta. If delta is below min_apply_delta, original content is kept.

Reflective engine

The maintained SkillOpt/GEPA-inspired path is --engine reflective, or the Codex-user shortcut codexopt improve. It evaluates a candidate document on tasks, captures textual feedback, asks an optimizer model to rewrite the document, and accepts the rewrite only when it improves held-out validation tasks.

Defaults stay offline and use static/verifier signals. To run the full live Codex loop, use:

codexopt improve --live

--live uses codex exec as both optimizer and judge. You can also set reflective.optimizer_model and reflective.judge_model to codex, openai/<model>, or another OpenAI-compatible model.

Legacy GEPA engine

--engine gepa is deprecated. It targeted an older gepa.optimize_anything API and now falls back with a clear warning. Use --engine reflective instead.

For SkillOpt-style skill optimization:

optimization:
  engine: "skillopt"
  reflection_model: "openai/gpt-5-mini"  # optional; without it, heuristic proposers are used
  skillopt_train_ratio: 0.67
  skillopt_edit_budget: 24
  skillopt_validation_delta: 0.01

Executable rollout task files can be listed in evidence.task_files:

[
  {
    "name": "skill-verifier",
    "description": "Run a repo-local verifier against the candidate skill.",
    "command": ["python", "scripts/verify_skill.py"],
    "timeout_seconds": 30
  }
]

Codex-backed rollout tasks can use backend: "codex" and codex_prompt:

[
  {
    "name": "codex-skill-task",
    "backend": "codex",
    "description": "Run Codex against the candidate skill.",
    "codex_prompt": "Use the local skill to update CHANGELOG.md for a patch release.",
    "timeout_seconds": 120,
    "expected_final_response_contains": "CHANGELOG.md",
    "expected_file_change": "CHANGELOG.md",
    "expected_file_contains": {
      "path": "CHANGELOG.md",
      "contains": "Patch"
    }
  }
]

CodexOpt evaluates those commands in a temporary copy of the repo with the candidate SKILL.md written in place, then records pass/fail details in optimize.json. For Codex-backed rollouts, CodexOpt also parses codex exec --json events into trajectory metadata: final response, commands, file changes, token usage, and errors.

For OpenAI-compatible reflective models, set the provider credentials and use reflective.optimizer_model / reflective.judge_model values such as openai/gpt-5-mini:

export OPENAI_API_KEY="your-openai-key"

For Gemini-compatible endpoints, set the credentials expected by your OpenAI-compatible client or run through codexopt improve --live to use codex exec directly.

export GEMINI_API_KEY="your-gemini-key"
export GOOGLE_API_KEY="$GEMINI_API_KEY"

Fallback behavior:

  • If a configured optimizer or judge model is unavailable, CodexOpt records a note and falls back to the weaker heuristic/static path.
  • Fallbacks are recorded in optimization artifacts, CLI summaries, and reports.

Artifacts and State

By default, everything is written under .codexopt/:

  • runs/<run_id>/scan.json
  • runs/<run_id>/benchmark.json
  • runs/<run_id>/optimize.json
  • runs/<run_id>/apply.json
  • backups/<timestamp>/... (created on non-dry-run apply)
  • state.json (tracks latest run ids per command type)

Run ids are timestamped and namespaced by command kind, for example:

  • 20260308T184800123456Z-benchmark
  • 20260308T184812654321Z-optimize-skills

Typical Team Workflow

  1. Commit current AGENTS.md and skills.
  2. Run scan and benchmark to establish baseline.
  3. Run optimize agents and/or optimize skills.
  4. Review optimize.json and diffs.
  5. Run apply --dry-run first, then apply.
  6. Run report and attach report to PR.

Examples

Example A: AGENTS.md cleanup

Before (AGENTS.md):

## Coding Rules
Always run tests before commit.
Always run tests before commit.


Keep changes minimal.

After optimization (heuristic):

## Coding Rules
Always run tests before commit.

Keep changes minimal.

What changed:

  • Removed duplicate adjacent line.
  • Compacted extra blank lines.

Example B: SKILL.md missing frontmatter

Before (.codex/skills/my_skill/SKILL.md):

Use this skill for repository release checks.
Run lint, tests, and changelog validation.

After optimization (heuristic):

---
name: my-skill
description: Repository-specific workflow skill.
---

Use this skill for repository release checks.
Run lint, tests, and changelog validation.

What changed:

  • Added required frontmatter block.
  • Generated normalized name from folder name.
  • Added default description.

Example C: Reproduce end-to-end on a repo

uv run codexopt init
uv run codexopt scan
uv run codexopt benchmark
uv run codexopt optimize agents --file AGENTS.md
uv run codexopt optimize skills --glob ".codex/skills/**/SKILL.md"
uv run codexopt apply --kind skills --dry-run
uv run codexopt apply --kind skills
uv run codexopt report --output codexopt-report.md

Files to inspect after running:

  • .codexopt/runs/*/scan.json
  • .codexopt/runs/*/benchmark.json
  • .codexopt/runs/*/optimize.json
  • .codexopt/runs/*/apply.json
  • .codexopt/backups/*

CI

GitHub Actions workflow is included at .github/workflows/ci.yml and runs:

  • uv lock --check for lockfile consistency.
  • uv sync --extra dev for environment setup.
  • Ruff lint checks.
  • Pytest tests.
  • Package build (uv build).

It does not publish packages.

Development

uv lock
uv sync --extra dev
uv run --no-sync ruff check src tests
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --no-sync pytest -q
uv build

FAQ / Troubleshooting

codexopt apply says "no optimization run found"

Cause:

  • No prior optimization run for the selected kind.
  • state.json does not contain the expected latest run pointer.

Fix:

uv run codexopt optimize agents
uv run codexopt apply --kind agents

Or pass an explicit run:

uv run codexopt apply --kind agents --run-id <run_id>

--engine gepa did not use GEPA

Cause:

  • The legacy GEPA engine targeted an older gepa.optimize_anything API.

Behavior:

  • CodexOpt falls back to heuristic optimization and records the deprecation reason.

Fix:

uv run codexopt optimize agents --engine reflective
uv run codexopt improve --live

apply --dry-run says files would be applied, but nothing changed

Expected behavior:

  • --dry-run reports candidate applications without writing files.

To write changes, run again without --dry-run:

uv run codexopt apply --kind agents

Build fails with network/isolation issues

If your environment blocks dependency resolution in isolated builds, use:

uv build

Pytest fails due to unrelated external plugins

Some environments auto-load global pytest plugins that can break local tests. Run with plugin autoload disabled:

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --no-sync pytest -q

Optimization produced no applied changes

Cause:

  • Best candidate delta is below optimization.min_apply_delta, or
  • File content is already equivalent.

Fix:

  • Lower optimization.min_apply_delta in codexopt.yaml, then re-run optimize/apply.

License

MIT. See LICENSE.

Author

  • Shashi (shashi@super-agentic.ai)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codexopt-0.2.0.tar.gz (55.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codexopt-0.2.0-py3-none-any.whl (51.2 kB view details)

Uploaded Python 3

File details

Details for the file codexopt-0.2.0.tar.gz.

File metadata

  • Download URL: codexopt-0.2.0.tar.gz
  • Upload date:
  • Size: 55.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for codexopt-0.2.0.tar.gz
Algorithm Hash digest
SHA256 52da3feb9aec0df1a4e8f1b6e5cbd0de3bc8a57738fbab30b436f1dd85382729
MD5 f51e7faa731577d162a68892e32846f2
BLAKE2b-256 e040b5c776317a5c834f3fd577e961390d207d6d3b29033d0d87b5f5f9abd108

See more details on using hashes here.

File details

Details for the file codexopt-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: codexopt-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 51.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for codexopt-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7fb233a428a260be0c80e39fa68452f0a700009561eb9d886e2a18ae51b3835f
MD5 847b5ddb3d263410d287594baa503908
BLAKE2b-256 50d62dbb0b90475b24ec7d02d6f35df06afcfddfdd8b210a85b3822cedf911ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page