Skip to main content

LoopBench — benchmark suite, metrics, submission pipeline, leaderboards

Project description

LoopBench

The public scoreboard for loop engineering.

Fixed tasks. Fixed seeds. Observed LES. Submissions anyone can audit.

No hand-waved demos — bring an LSS spec, get a number, climb the leaderboard.


CI PyPI License: MIT Tasks Suites


pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" loopbench loopgym
loopbench list
loopbench suite list

Run your first score · Live leaderboard · Loop Playground · Leaderboard JSON · Suite overview


LoopBench: install, list tasks, run, validate, rank

What LoopBench measures

You submit a loop specification (LSS YAML). LoopBench:

  1. Runs it through LoopGym on fixed task instances
  2. Computes Success@k and LES_obs across eight categories
  3. Validates your results.json against a published schema
  4. Ranks you on the public leaderboard — generalist (grand composite) is the primary rank
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench validate results.json
loopbench rank leaderboard/entries.json
loopbench rank leaderboard/entries.json --suite suite-repair

The measurement stack

flowchart LR
  YOU["Your LSS spec"]
  LB["LoopBench<br/>tasks · scoring · conformance"]
  LG["LoopGym<br/>SimEnv execution"]
  OUT["results.json → leaderboard"]

  YOU --> LB
  LB --> LG
  LG --> LB
  LB --> OUT
Layer Owns Repo
Spec LSS schema, LES formulas Loop Core Engineering
Data Trajectories (holdout v0.2) LoopNet
Runtime env.run_episode() LoopGym
Observability LTF traces, iteration metrics loop-observability
Measurement Tasks, LES_obs, anti-gaming LoopBench

LoopBench defines and scores. LoopGym runs. Never the other way around.

New to the stack? Start with the LoopNet end-to-end tutorial.


Suites and tasks (v0.2)

19 micro-tasks feed 4 comparison suites. Primary leaderboard rank = generalist (mean of suite scores).

Suite ID Label Micro-tasks
suite-repair Repair & Verify LB-CR-1, LB-REACT-1, LB-REFLEX-1, LB-OPT-1, LB-SAFE-1
suite-agent Multi-Agent LB-MA-1, LB-CREW-1, LB-GRAPH-1, LB-TOT-1, LB-VOTE-1
suite-knowledge Research & RAG LB-RS-1, LB-RAG-1, LB-BOOT-1, LB-AUTO-1
suite-rigor Composition & Safety LB-COMP-1, LB-NEST-1, LB-SIM-1, LB-HITL-1, LB-MEM-1
loopbench suite list
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench run --task LB-CR-1 --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json

Full catalog in tasks/index.yaml and SUITE-OVERVIEW.md.

Live leaderboard

Live board (updated 2026-06-25) — full rankings

Generalist:

  • Loop Engineering maintainer — LES 86.7
  • Loop Engineering maintainer (MA-1) — LES 86.5
  • Team Thorough — LES 86.4

Submit your loop →


Validate and reproduce

Post your 60-minute reproduction report on the reproduction challenge after REPRODUCE.md.

Beat maintainer LES (good-first #4)

One command: BEAT_LB-CR-1.md — target LES_obs ≥ 86.7 on LB-CR-1.

Also: BEAT_LB-RS-1.md (81.9) · BEAT_LB-MA-1.md (86.5) · BEAT_LB-COMP-1.md (80.3)

pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" "loopbench>=0.2.0" "loopgym>=0.1.2"
# see BEAT_LB-CR-1.md for full clone + run + submit

Score in 2 minutes

pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" loopbench loopgym

loopbench suite list

loopbench run \
  --suite suite-repair \
  --spec submissions/examples/spec-fast-loop.yaml \
  --seeds 0,1,2,3,4 \
  -o results.json

loopbench validate results.json
loopbench rank results.json

Submit to the leaderboard: open a PR adding your entry to leaderboard/entries.json.

v0.2 accepts SimEnv submissions (fully reproducible, no API keys). LiveEnv tier is optional.


Metrics explained

Metric Meaning
Success@k Fraction of instances reaching goal threshold
LES_obs Observed composite ∈ [0, 1]eight categories
Grand composite Mean of 4 suite scores — generalist rank
Cost Estimated USD from LSS cost limits
Robustness Quality retention across seeds

Display scale 0–100 is optional (les × 100).


Who this is for

You are… LoopBench gives you…
Loop designer A number you can improve release-over-release
Framework author A neutral arena — not your own benchmark
Researcher Reproducible tasks + published submission schema
Team lead Comparable scores across designs and vendors

Citation

@software{loopbench2026,
  title={LoopBench: Benchmark Suite for Loop Engineering},
  author={Malpani, Kanak},
  year={2026},
  url={https://pypi.org/project/loopbench/}
}

MIT · v0.2 · Contributing · Security · Status

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loopbench-0.2.0.tar.gz (136.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

loopbench-0.2.0-py3-none-any.whl (36.6 kB view details)

Uploaded Python 3

File details

Details for the file loopbench-0.2.0.tar.gz.

File metadata

  • Download URL: loopbench-0.2.0.tar.gz
  • Upload date:
  • Size: 136.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for loopbench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f148917ec401fcdb45adfae6dc41a5ef1dbcadae4536907d1cee95ab1a869da5
MD5 a92f32e323c94fa4c3095ce4e300133c
BLAKE2b-256 2b22c0151cd3b347dd6469172319c65f2fe5b72c4c49cb4982be113a6e9d5971

See more details on using hashes here.

File details

Details for the file loopbench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: loopbench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 36.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for loopbench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2343b159d1c2383af22c661b9f6832ec38a7aa4b6485cc8df6d90d3e18813df3
MD5 73267bdea55a1ea857b7d17bc9b18732
BLAKE2b-256 942bedde026239f2872a2930253a897b286cf1226d2bfd3ced777fbfb4f2bc35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page