Add your description here

Project description

OneShot Bench

Scalably converting pair-programming CLI trajectories into challenging digital agent tasks.

Key idea:

You pair progress with Codex in the terminal until you're happy with the result.
Codex uses mcp tools to save the diff, initial validation criteria like a rubric and fail->pass unit tests, and a record of its successful trajectory to local storage.
You can then (upon optional tweaking of the rubric/unit tests) load that scenario into modal or docker and evaluate another instance of codex using a different model to see if it can headlessly "one-shot" the problem.
Such data can be shared to and fetched from huggingface. See https://huggingface.co/datasets/JoshPurtell/one-shot-bench

Motivations:

SWEBench is over a year old, and we're doing work with agents. This is a simple-ish (please contribute!) approach to create and curate an evaluation dataset that actually matters for you.
CLI agents are really powerful and have lots of applications. OSS scaffolding for facilitating the creation of high-quality data using skilled practitioners that already love using them means more data in the ecosystem.
Selfishly, I hate evaluating agents on SWEBench and other legacy SWE agent benchmarks. I'm hoping to curate a super high-quality long-horizon agent dataset to really put Synth's algos to the test!!

+hello world
[results] ----------------------------------------
[results] Rubric total score: 54%
[results]  - task_completion: 0% (weight=0.4)
[results]  - code_quality: 80% (weight=0.3)
[results]  - testing: 100% (weight=0.3)
[results] Unit tests: 1 passed, 1 failed
[results] ========================================
[cleanup] Removing container

Quick start

Install codex-synth wrapper

bash scripts/install_codex_synth.sh

Optional: start local tracing workers and trust CA

uv tool install mitmproxy
bash scripts/start_synth_workers.sh

2a) One-time: enable MCP tools for task creation

bash scripts/create_tasks/setup_codex_mcp.sh

Inside Codex, ask: "What tools do you have?" — you should see repo.start_task.v1, repo.end_task.v1, repo.check_readiness.v1, repo.autofix_readiness.v1.

2.5) Optional: create a task locally - NOTE, there's a known bug where the MCP tools say failure after successful execution. Ignore it or push a fix :-)

codex-synth
<Hi codex, please update the readme with "hello world". Use the start task tool to begin and end task tool to finish>

Hello world: run a prepared task locally (Docker)

scripts/run_codex_box.sh data/tasks/prepared/add-lm-tracing-readme 900 50000

or run a newly created raw task (will automatically be prepared)

bash scripts/run_codex_box.sh data/tasks/created/update-readme-with-hello-world_20250812_181007

Artifacts and results will appear under data/runs/<run_id>/.

Get started guides

Setup (install, workers, MCP): guides/setup.md
Creating a task (Codex MCP, one-shot): guides/creating-a-task.md
Running tasks sequentially with Docker: guides/docker-sequential.md
Running tasks in parallel with Modal: guides/modal-parallel.md
Running on Modal (setup + single-run): guides/modal.md
Using Hugging Face datasets: guides/huggingface.md

Hugging Face integration

Upload a prepared task (slim, excludes heavy files):

Guide: guides/huggingface_upload.md

Command:

uv run python scripts/upload_prepared_task_hf.py \
  data/tasks/prepared/<slug> \
  JoshPurtell/one-shot-bench \
  tasks/<slug> \
  --yes

Run a prepared task fetched from Hugging Face (Docker):

Guide: guides/huggingface_run.md

Command:

uv run python scripts/run_hf_task_docker.py \
  --repo-id JoshPurtell/one-shot-bench \
  --task-slug <slug> \
  --model gpt-5-mini

Modal (optional)

To run Codex in Modal instead of Docker:

export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-5-mini
uv tool install modal && modal setup
SANDBOX_BACKEND=modal bash scripts/run_codex_box.sh data/tasks/prepared/<slug>

hello world

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Aug 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oneshot_synth-0.1.1.tar.gz (54.2 kB view details)

Uploaded Aug 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oneshot_synth-0.1.1-py3-none-any.whl (63.4 kB view details)

Uploaded Aug 24, 2025 Python 3

File details

Details for the file oneshot_synth-0.1.1.tar.gz.

File metadata

Download URL: oneshot_synth-0.1.1.tar.gz
Upload date: Aug 24, 2025
Size: 54.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for oneshot_synth-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3496f929087b097d88b39801e9a2aad20fbf021753ea06f6ce2121048cda78a5`
MD5	`08a9d01ae744f675025ecdeda1e0d93a`
BLAKE2b-256	`7e2edc784fe68ceaa70f4359f047472954e682497cfbed7afdc7747e5fc4dc31`

See more details on using hashes here.

File details

Details for the file oneshot_synth-0.1.1-py3-none-any.whl.

File metadata

Download URL: oneshot_synth-0.1.1-py3-none-any.whl
Upload date: Aug 24, 2025
Size: 63.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for oneshot_synth-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b9dffb413bdb7001bbaf41ec55766f1ca168e96fdb4b40742276ca2283b1cb5`
MD5	`2c2699f7523dafcedd2a97cedb08e448`
BLAKE2b-256	`4659537d53ff83ca5312f196928a08d193fe6f826122154bbdec8371f24f4cda`

See more details on using hashes here.

oneshot-synth 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

OneShot Bench

Quick start

Get started guides

Hugging Face integration

Modal (optional)

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes