Skip to main content

Add your description here

Project description

OneShot Bench

Scalably converting pair-programming CLI trajectories into challenging digital agent tasks.

Key idea:

  • You pair progress with Codex in the terminal until you're happy with the result.
  • Codex uses mcp tools to save the diff, initial validation criteria like a rubric and fail->pass unit tests, and a record of its successful trajectory to local storage.
  • You can then (upon optional tweaking of the rubric/unit tests) load that scenario into modal or docker and evaluate another instance of codex using a different model to see if it can headlessly "one-shot" the problem.
  • Such data can be shared to and fetched from huggingface. See https://huggingface.co/datasets/JoshPurtell/one-shot-bench

Motivations:

  • SWEBench is over a year old, and we're doing work with agents. This is a simple-ish (please contribute!) approach to create and curate an evaluation dataset that actually matters for you.
  • CLI agents are really powerful and have lots of applications. OSS scaffolding for facilitating the creation of high-quality data using skilled practitioners that already love using them means more data in the ecosystem.
  • Selfishly, I hate evaluating agents on SWEBench and other legacy SWE agent benchmarks. I'm hoping to curate a super high-quality long-horizon agent dataset to really put Synth's algos to the test!!
+hello world
[results] ----------------------------------------
[results] Rubric total score: 54%
[results]  - task_completion: 0% (weight=0.4)
[results]  - code_quality: 80% (weight=0.3)
[results]  - testing: 100% (weight=0.3)
[results] Unit tests: 1 passed, 1 failed
[results] ========================================
[cleanup] Removing container

Quick start

  1. Install codex-synth wrapper
bash scripts/install_codex_synth.sh
  1. Optional: start local tracing workers and trust CA
uv tool install mitmproxy
bash scripts/start_synth_workers.sh

2a) One-time: enable MCP tools for task creation

bash scripts/create_tasks/setup_codex_mcp.sh

Inside Codex, ask: "What tools do you have?" — you should see repo.start_task.v1, repo.end_task.v1, repo.check_readiness.v1, repo.autofix_readiness.v1.

2.5) Optional: create a task locally - NOTE, there's a known bug where the MCP tools say failure after successful execution. Ignore it or push a fix :-)

codex-synth
<Hi codex, please update the readme with "hello world". Use the start task tool to begin and end task tool to finish>
  1. Hello world: run a prepared task locally (Docker)
scripts/run_codex_box.sh data/tasks/prepared/add-lm-tracing-readme 900 50000

or run a newly created raw task (will automatically be prepared)

bash scripts/run_codex_box.sh data/tasks/created/update-readme-with-hello-world_20250812_181007 

Artifacts and results will appear under data/runs/<run_id>/.

Get started guides

  • Setup (install, workers, MCP): guides/setup.md
  • Creating a task (Codex MCP, one-shot): guides/creating-a-task.md
  • Running tasks sequentially with Docker: guides/docker-sequential.md
  • Running tasks in parallel with Modal: guides/modal-parallel.md
  • Running on Modal (setup + single-run): guides/modal.md
  • Using Hugging Face datasets: guides/huggingface.md

Hugging Face integration

  • Upload a prepared task (slim, excludes heavy files):

    • Guide: guides/huggingface_upload.md
    • Command:
      uv run python scripts/upload_prepared_task_hf.py \
        data/tasks/prepared/<slug> \
        JoshPurtell/one-shot-bench \
        tasks/<slug> \
        --yes
      
  • Run a prepared task fetched from Hugging Face (Docker):

    • Guide: guides/huggingface_run.md
    • Command:
      uv run python scripts/run_hf_task_docker.py \
        --repo-id JoshPurtell/one-shot-bench \
        --task-slug <slug> \
        --model gpt-5-mini
      

Modal (optional)

  • To run Codex in Modal instead of Docker:
    export OPENAI_API_KEY=sk-...
    export OPENAI_MODEL=gpt-5-mini
    uv tool install modal && modal setup
    SANDBOX_BACKEND=modal bash scripts/run_codex_box.sh data/tasks/prepared/<slug>
    

hello world

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oneshot_synth-0.1.1.tar.gz (54.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oneshot_synth-0.1.1-py3-none-any.whl (63.4 kB view details)

Uploaded Python 3

File details

Details for the file oneshot_synth-0.1.1.tar.gz.

File metadata

  • Download URL: oneshot_synth-0.1.1.tar.gz
  • Upload date:
  • Size: 54.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for oneshot_synth-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3496f929087b097d88b39801e9a2aad20fbf021753ea06f6ce2121048cda78a5
MD5 08a9d01ae744f675025ecdeda1e0d93a
BLAKE2b-256 7e2edc784fe68ceaa70f4359f047472954e682497cfbed7afdc7747e5fc4dc31

See more details on using hashes here.

File details

Details for the file oneshot_synth-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: oneshot_synth-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 63.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for oneshot_synth-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6b9dffb413bdb7001bbaf41ec55766f1ca168e96fdb4b40742276ca2283b1cb5
MD5 2c2699f7523dafcedd2a97cedb08e448
BLAKE2b-256 4659537d53ff83ca5312f196928a08d193fe6f826122154bbdec8371f24f4cda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page