Add your description here
Project description
OneShot Bench
Scalably converting pair-programming CLI trajectories into challenging digital agent tasks.
Key idea:
- You pair progress with Codex in the terminal until you're happy with the result.
- Codex uses mcp tools to save the diff, initial validation criteria like a rubric and fail->pass unit tests, and a record of its successful trajectory to local storage.
- You can then (upon optional tweaking of the rubric/unit tests) load that scenario into modal or docker and evaluate another instance of codex using a different model to see if it can headlessly "one-shot" the problem.
- Such data can be shared to and fetched from huggingface. See https://huggingface.co/datasets/JoshPurtell/one-shot-bench
Motivations:
- SWEBench is over a year old, and we're doing work with agents. This is a simple-ish (please contribute!) approach to create and curate an evaluation dataset that actually matters for you.
- CLI agents are really powerful and have lots of applications. OSS scaffolding for facilitating the creation of high-quality data using skilled practitioners that already love using them means more data in the ecosystem.
- Selfishly, I hate evaluating agents on SWEBench and other legacy SWE agent benchmarks. I'm hoping to curate a super high-quality long-horizon agent dataset to really put Synth's algos to the test!!
+hello world
[results] ----------------------------------------
[results] Rubric total score: 54%
[results] - task_completion: 0% (weight=0.4)
[results] - code_quality: 80% (weight=0.3)
[results] - testing: 100% (weight=0.3)
[results] Unit tests: 1 passed, 1 failed
[results] ========================================
[cleanup] Removing container
Quick start
- Install codex-synth wrapper
bash scripts/install_codex_synth.sh
- Optional: start local tracing workers and trust CA
uv tool install mitmproxy
bash scripts/start_synth_workers.sh
2a) One-time: enable MCP tools for task creation
bash scripts/create_tasks/setup_codex_mcp.sh
Inside Codex, ask: "What tools do you have?" — you should see repo.start_task.v1, repo.end_task.v1, repo.check_readiness.v1, repo.autofix_readiness.v1.
2.5) Optional: create a task locally - NOTE, there's a known bug where the MCP tools say failure after successful execution. Ignore it or push a fix :-)
codex-synth
<Hi codex, please update the readme with "hello world". Use the start task tool to begin and end task tool to finish>
- Hello world: run a prepared task locally (Docker)
scripts/run_codex_box.sh data/tasks/prepared/add-lm-tracing-readme 900 50000
or run a newly created raw task (will automatically be prepared)
bash scripts/run_codex_box.sh data/tasks/created/update-readme-with-hello-world_20250812_181007
Artifacts and results will appear under data/runs/<run_id>/.
Get started guides
- Setup (install, workers, MCP):
guides/setup.md - Creating a task (Codex MCP, one-shot):
guides/creating-a-task.md - Running tasks sequentially with Docker:
guides/docker-sequential.md - Running tasks in parallel with Modal:
guides/modal-parallel.md - Running on Modal (setup + single-run):
guides/modal.md - Using Hugging Face datasets:
guides/huggingface.md
Hugging Face integration
-
Upload a prepared task (slim, excludes heavy files):
- Guide:
guides/huggingface_upload.md - Command:
uv run python scripts/upload_prepared_task_hf.py \ data/tasks/prepared/<slug> \ JoshPurtell/one-shot-bench \ tasks/<slug> \ --yes
- Guide:
-
Run a prepared task fetched from Hugging Face (Docker):
- Guide:
guides/huggingface_run.md - Command:
uv run python scripts/run_hf_task_docker.py \ --repo-id JoshPurtell/one-shot-bench \ --task-slug <slug> \ --model gpt-5-mini
- Guide:
Modal (optional)
- To run Codex in Modal instead of Docker:
export OPENAI_API_KEY=sk-... export OPENAI_MODEL=gpt-5-mini uv tool install modal && modal setup SANDBOX_BACKEND=modal bash scripts/run_codex_box.sh data/tasks/prepared/<slug>
hello world
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oneshot_synth-0.1.1.tar.gz.
File metadata
- Download URL: oneshot_synth-0.1.1.tar.gz
- Upload date:
- Size: 54.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3496f929087b097d88b39801e9a2aad20fbf021753ea06f6ce2121048cda78a5
|
|
| MD5 |
08a9d01ae744f675025ecdeda1e0d93a
|
|
| BLAKE2b-256 |
7e2edc784fe68ceaa70f4359f047472954e682497cfbed7afdc7747e5fc4dc31
|
File details
Details for the file oneshot_synth-0.1.1-py3-none-any.whl.
File metadata
- Download URL: oneshot_synth-0.1.1-py3-none-any.whl
- Upload date:
- Size: 63.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b9dffb413bdb7001bbaf41ec55766f1ca168e96fdb4b40742276ca2283b1cb5
|
|
| MD5 |
2c2699f7523dafcedd2a97cedb08e448
|
|
| BLAKE2b-256 |
4659537d53ff83ca5312f196928a08d193fe6f826122154bbdec8371f24f4cda
|