Skip to main content

Optimize a software metric with Codex, git worktrees, and Docker

Project description

codex-optimize

https://github.com/user-attachments/assets/7646dab7-d12a-4574-a493-9d130e9042e9

Optimize any software with the Codex SDK.

codopt clones your repository into a run directory, fans out candidate branches with git worktrees, runs one Codex agent per branch in its own Docker container, and evaluates each branch with a benchmark command plus a correctness test command. Surviving branches fork again in later rounds.

By default, codopt snapshots your current working tree into a disposable internal repo first, so local tracked edits are part of the optimization baseline even if they are not committed yet.

Why?

One appraoch to AI assisted software optimization is to just point it to some code and then tell it to optimize it. There are several problems with this:

  1. Agents tend to cheat benchmarks, even unintentionally. One of the common behavior patterns when you tell an agent to maximize a value unconstrained is the agent will simply hack through the benchmarks and tests so produce a result that seems great but in closer inspection is not a substantive optimization.
  2. Agents are non deterministic, so it can fail at the optimization one time and then the next time succeed even with the same prompt.
  3. Agents can get lazy! This is very unintuitive but many times since it thinks that it has provided the answer, prompting "optimize" results in it concluding it is done. After it states that it is done, then since it being done is in its context it will just continue to believe this. In a sense, it has poisoned its own context.

codex-optimize attempt to sovle these problems:

  1. codopt explicitly partions the source code, optimization tests, and correctness tests. since these parts are partioned and in git they can be reset to evaluate whether the source code changes were substantive while preventing the benchmark hacking behavior.
  2. By running a beam search strategy, we can see a diverse variety of attempts and keep exploring the ones that work. The below example run shows a good example of this where some of the Codex agents actually degraded the quality of the optimization but the top candidates signficantly optimized the code.
  3. By pruning nodes that are failing or stagnating, we can avoid context poisoning and get results over more iterations. This is also demonstrated in the example below were after some iterations some fail while some keep improving.

The core idea is to use the Codex SDK to optimize more deterministically than using Skills or prompting.

Quick Start

example/life contains a Conway's Game of Life challenge chosen to be optimizable but not one-shottable.

Install the CLI locally for testing:

uv tool install /path/to/codex-optimize

View the result of my run in the UI :

codopt ui --run-root example/life_result/run

Alternatively you can run it yourself.

Run:

codopt run \
  --edit example/life/life.py --metric example/life/metric.json --metric-key score --command "python3 example/life/benchmark.py" \
  --branch 3 --time 120 --info example/life/INFO.md --max-agents 6 --test "python3 example/life/tests.py" --docker-image codopt-life:latest --rounds 2

Read more about this run in the result's README.MD.

An alternative option to running the program yourself is asking your agent to use it! If this is your goal there is an optimize skill folder you can copy into ~/.codex/skills/optimize and restart Codex.

Here is a demo video of Codex using the codopt skill to generate a 33% optimization of token per second in LLM inference.

https://github.com/user-attachments/assets/f34ac402-c19c-4ced-9215-5ff9f2a0e889

Read more about that here or view the repo codopt created here.

CLI Flags

  • --edit: repeatable file or directory the agent may edit
  • --metric: metric file written by the benchmark command
  • --metric-key: JSON key to read when the metric file is JSON, default score
  • --lower-is-better: invert the parsed metric value for ranking
  • --command: benchmark command
  • --command-file: path to a shell snippet file executed with sh -eu; repo-local files run from the cloned repo path, external files are copied into the run root
  • --branch: children per surviving node
  • --time: per-node Codex time budget in seconds
  • --info / --info-file: background context file given to the agent, may be outside the repo
  • --info-text: inline background context for the agent
  • --max-agents: active-node cap used to decide survivor count
  • --test: correctness test command
  • --test-file: path to a shell snippet file executed with sh -eu; repo-local files run from the cloned repo path, external files are copied into the run root
  • --docker-image: optional prebuilt container image for agent and evaluation runs
  • --dockerfile: optional Dockerfile to build and use for agent and evaluation runs
  • --source-mode: working-tree (default) snapshots the current repo state; head uses Git HEAD only
  • --rounds: tournament depth
  • --allow-path: repeatable extra writable path
  • --keep-worktrees: keep worktree directories after completion

Metric Key

Your benchmark command does not need to match the Life example , but it does need to produce one metric file that codopt can parse:

  • if the metric file is plain text, it must contain a single numeric value
  • if the metric file is JSON, codopt reads one numeric field from it
  • by default that JSON field is score unless a metric-key flag is passed
  • by default higher values are treated as better unless the lower-is-better flag is passed

Requirements

Before running codopt, you need:

  • git
  • docker
  • uv
  • Python 3 on the host
  • an existing Codex login on the host in ~/.codex

Important setup notes:

  • run codopt from the root of the Git repo you want to optimize
  • Docker must be running
  • codopt seeds a run-local CODEX_HOME from your host ~/.codex, so you need to already be authenticated before starting
  • by default codopt auto-generates and builds a runtime image for the repo, with special handling for common project types like Python, Node, Rust, Go, Java, and Haskell
  • if you override with --docker-image or --dockerfile, the resulting image must contain python3, git, and uv
  • codopt removes the ephemeral images it builds itself after validate and run, so repeated runs do not keep piling up codopt-auto-* images

First-Run Pattern

For a new repo, prefer this sequence:

  1. Wire a benchmark command, test command, and info text or info file.
  2. Run codopt validate ....
  3. If validation fails in the auto-generated image, only then add --dockerfile or --docker-image.
  4. Once validation succeeds, run the full bounded tournament with codopt run ....

Starter scaffolding:

codopt scaffold --output-dir codopt_scaffold

This writes starter benchmark.sh, test.sh, Dockerfile, and INFO.md files you can adapt for a new repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codex_optimize-0.1.0.tar.gz (144.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codex_optimize-0.1.0-py3-none-any.whl (149.2 kB view details)

Uploaded Python 3

File details

Details for the file codex_optimize-0.1.0.tar.gz.

File metadata

  • Download URL: codex_optimize-0.1.0.tar.gz
  • Upload date:
  • Size: 144.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for codex_optimize-0.1.0.tar.gz
Algorithm Hash digest
SHA256 56cb6809d3e744d96256b8d726086375b0b0e811a820532eca25997628cdc8e2
MD5 bf878535e5f043907e5bb568098dadf9
BLAKE2b-256 facc53d109af1fe535a8392adbcb57cc9e1612ba82b5e0a428e000bc308a0162

See more details on using hashes here.

File details

Details for the file codex_optimize-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for codex_optimize-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 696f6cf0d7cac2d3a90a1e6baf01dd9605af1e97c5e74eb016f646afe5d3bf1a
MD5 5c1c6a12d9d12af90ec4891a9caf2422
BLAKE2b-256 816ed7b45b9f5b72890ca7576f5664336dda76ea82abe3b3f4469b8941cee02c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page