LoopBench — benchmark suite, metrics, submission pipeline, leaderboards
Project description
LoopBench
The public scoreboard for loop engineering.
Fixed tasks. Fixed seeds. Observed LES. Submissions anyone can audit.
No hand-waved demos — bring an LSS spec, get a number, climb the leaderboard.
pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" loopbench loopgym
loopbench list
loopbench suite list
Run your first score · Live leaderboard · Loop Playground · Leaderboard JSON · Suite overview
What LoopBench measures
You submit a loop specification (LSS YAML). LoopBench:
- Runs it through LoopGym on fixed task instances
- Computes Success@k and LES_obs across eight categories
- Validates your
results.jsonagainst a published schema - Ranks you on the public leaderboard — generalist (grand composite) is the primary rank
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench validate results.json
loopbench rank leaderboard/entries.json
loopbench rank leaderboard/entries.json --suite suite-repair
The measurement stack
flowchart LR
YOU["Your LSS spec"]
LB["LoopBench<br/>tasks · scoring · conformance"]
LG["LoopGym<br/>SimEnv execution"]
OUT["results.json → leaderboard"]
YOU --> LB
LB --> LG
LG --> LB
LB --> OUT
| Layer | Owns | Repo |
|---|---|---|
| Spec | LSS schema, LES formulas | Loop Core Engineering |
| Data | Trajectories (holdout v0.2) | LoopNet |
| Runtime | env.run_episode() |
LoopGym |
| Observability | LTF traces, iteration metrics | loop-observability |
| Measurement | Tasks, LES_obs, anti-gaming | LoopBench |
LoopBench defines and scores. LoopGym runs. Never the other way around.
New to the stack? Start with the LoopNet end-to-end tutorial.
Suites and tasks (v0.2)
19 micro-tasks feed 4 comparison suites. Primary leaderboard rank = generalist (mean of suite scores).
| Suite ID | Label | Micro-tasks |
|---|---|---|
suite-repair |
Repair & Verify | LB-CR-1, LB-REACT-1, LB-REFLEX-1, LB-OPT-1, LB-SAFE-1 |
suite-agent |
Multi-Agent | LB-MA-1, LB-CREW-1, LB-GRAPH-1, LB-TOT-1, LB-VOTE-1 |
suite-knowledge |
Research & RAG | LB-RS-1, LB-RAG-1, LB-BOOT-1, LB-AUTO-1 |
suite-rigor |
Composition & Safety | LB-COMP-1, LB-NEST-1, LB-SIM-1, LB-HITL-1, LB-MEM-1 |
loopbench suite list
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench run --task LB-CR-1 --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
Full catalog in tasks/index.yaml and SUITE-OVERVIEW.md.
Live leaderboard
Live board (updated 2026-06-25) — full rankings
Generalist:
- Loop Engineering maintainer — LES 86.7
- Loop Engineering maintainer (MA-1) — LES 86.5
- Team Thorough — LES 86.4
Validate and reproduce
Post your 60-minute reproduction report on the reproduction challenge after REPRODUCE.md.
Beat maintainer LES (good-first #4)
One command: BEAT_LB-CR-1.md — target LES_obs ≥ 86.7 on LB-CR-1.
Also: BEAT_LB-RS-1.md (81.9) · BEAT_LB-MA-1.md (86.5) · BEAT_LB-COMP-1.md (80.3)
pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" "loopbench>=0.2.0" "loopgym>=0.1.2"
# see BEAT_LB-CR-1.md for full clone + run + submit
Score in 2 minutes
pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" loopbench loopgym
loopbench suite list
loopbench run \
--suite suite-repair \
--spec submissions/examples/spec-fast-loop.yaml \
--seeds 0,1,2,3,4 \
-o results.json
loopbench validate results.json
loopbench rank results.json
Submit to the leaderboard: open a PR adding your entry to leaderboard/entries.json.
v0.2 accepts SimEnv submissions (fully reproducible, no API keys). LiveEnv tier is optional.
Metrics explained
| Metric | Meaning |
|---|---|
| Success@k | Fraction of instances reaching goal threshold |
| LES_obs | Observed composite ∈ [0, 1] — eight categories |
| Grand composite | Mean of 4 suite scores — generalist rank |
| Cost | Estimated USD from LSS cost limits |
| Robustness | Quality retention across seeds |
Display scale 0–100 is optional (les × 100).
Who this is for
| You are… | LoopBench gives you… |
|---|---|
| Loop designer | A number you can improve release-over-release |
| Framework author | A neutral arena — not your own benchmark |
| Researcher | Reproducible tasks + published submission schema |
| Team lead | Comparable scores across designs and vendors |
Citation
@software{loopbench2026,
title={LoopBench: Benchmark Suite for Loop Engineering},
author={Malpani, Kanak},
year={2026},
url={https://pypi.org/project/loopbench/}
}
MIT · v0.2 · Contributing · Security · Status
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file loopbench-0.2.0.tar.gz.
File metadata
- Download URL: loopbench-0.2.0.tar.gz
- Upload date:
- Size: 136.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f148917ec401fcdb45adfae6dc41a5ef1dbcadae4536907d1cee95ab1a869da5
|
|
| MD5 |
a92f32e323c94fa4c3095ce4e300133c
|
|
| BLAKE2b-256 |
2b22c0151cd3b347dd6469172319c65f2fe5b72c4c49cb4982be113a6e9d5971
|
File details
Details for the file loopbench-0.2.0-py3-none-any.whl.
File metadata
- Download URL: loopbench-0.2.0-py3-none-any.whl
- Upload date:
- Size: 36.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2343b159d1c2383af22c661b9f6832ec38a7aa4b6485cc8df6d90d3e18813df3
|
|
| MD5 |
73267bdea55a1ea857b7d17bc9b18732
|
|
| BLAKE2b-256 |
942bedde026239f2872a2930253a897b286cf1226d2bfd3ced777fbfb4f2bc35
|