Optimize a software metric with Codex, git worktrees, and Docker
Project description
codex-optimize
https://github.com/user-attachments/assets/7646dab7-d12a-4574-a493-9d130e9042e9
Optimize any software with the Codex SDK.
codopt clones your repository into a run directory, fans out candidate branches with git worktrees, runs one Codex agent per branch in its own Docker container, and evaluates each branch with a benchmark command plus a correctness test command. Surviving branches fork again in later rounds.
By default, codopt snapshots your current working tree into a disposable internal repo first, so local tracked edits are part of the optimization baseline even if they are not committed yet.
Why?
One appraoch to AI assisted software optimization is to just point it to some code and then tell it to optimize it. There are several problems with this:
- Agents tend to cheat benchmarks, even unintentionally. One of the common behavior patterns when you tell an agent to maximize a value unconstrained is the agent will simply hack through the benchmarks and tests so produce a result that seems great but in closer inspection is not a substantive optimization.
- Agents are non deterministic, so it can fail at the optimization one time and then the next time succeed even with the same prompt.
- Agents can get lazy! This is very unintuitive but many times since it thinks that it has provided the answer, prompting "optimize" results in it concluding it is done. After it states that it is done, then since it being done is in its context it will just continue to believe this. In a sense, it has poisoned its own context.
codex-optimize attempt to sovle these problems:
- codopt explicitly partions the source code, optimization tests, and correctness tests. since these parts are partioned and in git they can be reset to evaluate whether the source code changes were substantive while preventing the benchmark hacking behavior.
- By running a beam search strategy, we can see a diverse variety of attempts and keep exploring the ones that work. The below example run shows a good example of this where some of the Codex agents actually degraded the quality of the optimization but the top candidates signficantly optimized the code.
- By pruning nodes that are failing or stagnating, we can avoid context poisoning and get results over more iterations. This is also demonstrated in the example below were after some iterations some fail while some keep improving.
The core idea is to use the Codex SDK to optimize more deterministically than using Skills or prompting.
Quick Start
example/life contains a Conway's Game of Life challenge chosen to be optimizable but not one-shottable.
Install the CLI locally for testing:
uv tool install /path/to/codex-optimize
View the result of my run in the UI :
codopt ui --run-root example/life_result/run
Alternatively you can run it yourself.
Run:
codopt run \
--edit example/life/life.py --metric example/life/metric.json --metric-key score --command "python3 example/life/benchmark.py" \
--branch 3 --time 120 --info example/life/INFO.md --max-agents 6 --test "python3 example/life/tests.py" --docker-image codopt-life:latest --rounds 2
Read more about this run in the result's README.MD.
An alternative option to running the program yourself is asking your agent to use it!
If this is your goal there is an optimize skill folder you can copy into ~/.codex/skills/optimize and restart Codex.
Here is a demo video of Codex using the codopt skill to generate a 33% optimization of token per second in LLM inference.
https://github.com/user-attachments/assets/f34ac402-c19c-4ced-9215-5ff9f2a0e889
Read more about that here or view the repo codopt created here.
CLI Flags
--edit: repeatable file or directory the agent may edit--metric: metric file written by the benchmark command--metric-key: JSON key to read when the metric file is JSON, defaultscore--lower-is-better: invert the parsed metric value for ranking--command: benchmark command--command-file: path to a shell snippet file executed withsh -eu; repo-local files run from the cloned repo path, external files are copied into the run root--branch: children per surviving node--time: per-node Codex time budget in seconds--info/--info-file: background context file given to the agent, may be outside the repo--info-text: inline background context for the agent--max-agents: active-node cap used to decide survivor count--test: correctness test command--test-file: path to a shell snippet file executed withsh -eu; repo-local files run from the cloned repo path, external files are copied into the run root--docker-image: optional prebuilt container image for agent and evaluation runs--dockerfile: optional Dockerfile to build and use for agent and evaluation runs--source-mode:working-tree(default) snapshots the current repo state;headuses GitHEADonly--rounds: tournament depth--allow-path: repeatable extra writable path--keep-worktrees: keep worktree directories after completion
Metric Key
Your benchmark command does not need to match the Life example , but it does need to produce one metric file that codopt can parse:
- if the metric file is plain text, it must contain a single numeric value
- if the metric file is JSON,
codoptreads one numeric field from it - by default that JSON field is
scoreunless a metric-key flag is passed - by default higher values are treated as better unless the lower-is-better flag is passed
Requirements
Before running codopt, you need:
gitdockeruv- Python 3 on the host
- an existing Codex login on the host in
~/.codex
Important setup notes:
- run
codoptfrom the root of the Git repo you want to optimize - Docker must be running
codoptseeds a run-localCODEX_HOMEfrom your host~/.codex, so you need to already be authenticated before starting- by default
codoptauto-generates and builds a runtime image for the repo, with special handling for common project types like Python, Node, Rust, Go, Java, and Haskell - if you override with
--docker-imageor--dockerfile, the resulting image must containpython3,git, anduv codoptremoves the ephemeral images it builds itself aftervalidateandrun, so repeated runs do not keep piling upcodopt-auto-*images
First-Run Pattern
For a new repo, prefer this sequence:
- Wire a benchmark command, test command, and info text or info file.
- Run
codopt validate .... - If validation fails in the auto-generated image, only then add
--dockerfileor--docker-image. - Once validation succeeds, run the full bounded tournament with
codopt run ....
Starter scaffolding:
codopt scaffold --output-dir codopt_scaffold
This writes starter benchmark.sh, test.sh, Dockerfile, and INFO.md files you can adapt for a new repo.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codex_optimize-0.1.0.tar.gz.
File metadata
- Download URL: codex_optimize-0.1.0.tar.gz
- Upload date:
- Size: 144.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56cb6809d3e744d96256b8d726086375b0b0e811a820532eca25997628cdc8e2
|
|
| MD5 |
bf878535e5f043907e5bb568098dadf9
|
|
| BLAKE2b-256 |
facc53d109af1fe535a8392adbcb57cc9e1612ba82b5e0a428e000bc308a0162
|
File details
Details for the file codex_optimize-0.1.0-py3-none-any.whl.
File metadata
- Download URL: codex_optimize-0.1.0-py3-none-any.whl
- Upload date:
- Size: 149.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
696f6cf0d7cac2d3a90a1e6baf01dd9605af1e97c5e74eb016f646afe5d3bf1a
|
|
| MD5 |
5c1c6a12d9d12af90ec4891a9caf2422
|
|
| BLAKE2b-256 |
816ed7b45b9f5b72890ca7576f5664336dda76ea82abe3b3f4469b8941cee02c
|