LoopBench — benchmark suite, metrics, submission pipeline, leaderboards

These details have not been verified by PyPI

Project links

Project description

LoopBench

The public scoreboard for loop engineering.

Fixed tasks. Fixed seeds. Observed LES. Submissions anyone can audit.

No hand-waved demos — bring an LSS spec, get a number, climb the leaderboard.

pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" loopbench loopgym
loopbench list
loopbench suite list

Run your first score · Live leaderboard · Loop Playground · Leaderboard JSON · Suite overview

LoopBench: install, list tasks, run, validate, rank

What LoopBench measures

You submit a loop specification (LSS YAML). LoopBench:

Runs it through LoopGym on fixed task instances
Computes Success@k and LES_obs across eight categories
Validates your results.json against a published schema
Ranks you on the public leaderboard — generalist (grand composite) is the primary rank

loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench validate results.json
loopbench rank leaderboard/entries.json
loopbench rank leaderboard/entries.json --suite suite-repair

The measurement stack

flowchart LR
  YOU["Your LSS spec"]
  LB["LoopBench<br/>tasks · scoring · conformance"]
  LG["LoopGym<br/>SimEnv execution"]
  OUT["results.json → leaderboard"]

  YOU --> LB
  LB --> LG
  LG --> LB
  LB --> OUT

Layer	Owns	Repo
Spec	LSS schema, LES formulas	Loop Core Engineering
Data	Trajectories (holdout v0.2)	LoopNet
Runtime	`env.run_episode()`	LoopGym
Observability	LTF traces, iteration metrics	loop-observability
Measurement	Tasks, LES_obs, anti-gaming	LoopBench

LoopBench defines and scores. LoopGym runs. Never the other way around.

New to the stack? Start with the LoopNet end-to-end tutorial.

Suites and tasks (v0.2)

19 micro-tasks feed 4 comparison suites. Primary leaderboard rank = generalist (mean of suite scores).

Suite ID	Label	Micro-tasks
`suite-repair`	Repair & Verify	LB-CR-1, LB-REACT-1, LB-REFLEX-1, LB-OPT-1, LB-SAFE-1
`suite-agent`	Multi-Agent	LB-MA-1, LB-CREW-1, LB-GRAPH-1, LB-TOT-1, LB-VOTE-1
`suite-knowledge`	Research & RAG	LB-RS-1, LB-RAG-1, LB-BOOT-1, LB-AUTO-1
`suite-rigor`	Composition & Safety	LB-COMP-1, LB-NEST-1, LB-SIM-1, LB-HITL-1, LB-MEM-1

loopbench suite list
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench run --task LB-CR-1 --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json

Full catalog in tasks/index.yaml and SUITE-OVERVIEW.md.

Live leaderboard

Live board (updated 2026-06-25) — full rankings

Generalist:

Loop Engineering maintainer — LES 86.7
Loop Engineering maintainer (MA-1) — LES 86.5
Team Thorough — LES 86.4

Submit your loop →

Validate and reproduce

Post your 60-minute reproduction report on the reproduction challenge after REPRODUCE.md.

Beat maintainer LES (good-first #4)

One command: BEAT_LB-CR-1.md — target LES_obs ≥ 86.7 on LB-CR-1.

Also: BEAT_LB-RS-1.md (81.9) · BEAT_LB-MA-1.md (86.5) · BEAT_LB-COMP-1.md (80.3)

pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" "loopbench>=0.2.0" "loopgym>=0.1.2"
# see BEAT_LB-CR-1.md for full clone + run + submit

Score in 2 minutes

pip install "le-loopforge>=0.2.0" "le-loopctl>=0.1.0" loopbench loopgym

loopbench suite list

loopbench run \
  --suite suite-repair \
  --spec submissions/examples/spec-fast-loop.yaml \
  --seeds 0,1,2,3,4 \
  -o results.json

loopbench validate results.json
loopbench rank results.json

Submit to the leaderboard: open a PR adding your entry to leaderboard/entries.json.

v0.2 accepts SimEnv submissions (fully reproducible, no API keys). LiveEnv tier is optional.

Metrics explained

Metric	Meaning
Success@k	Fraction of instances reaching goal threshold
LES_obs	Observed composite ∈ `[0, 1]` — eight categories
Grand composite	Mean of 4 suite scores — generalist rank
Cost	Estimated USD from LSS cost limits
Robustness	Quality retention across seeds

Display scale 0–100 is optional (les × 100).

Who this is for

You are…	LoopBench gives you…
Loop designer	A number you can improve release-over-release
Framework author	A neutral arena — not your own benchmark
Researcher	Reproducible tasks + published submission schema
Team lead	Comparable scores across designs and vendors

Citation

@software{loopbench2026,
  title={LoopBench: Benchmark Suite for Loop Engineering},
  author={Malpani, Kanak},
  year={2026},
  url={https://pypi.org/project/loopbench/}
}

_{MIT · v0.2 · Contributing · Security · Status}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jun 30, 2026

0.1.1

Jun 24, 2026

0.1.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loopbench-0.2.0.tar.gz (136.4 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

loopbench-0.2.0-py3-none-any.whl (36.6 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file loopbench-0.2.0.tar.gz.

File metadata

Download URL: loopbench-0.2.0.tar.gz
Upload date: Jun 30, 2026
Size: 136.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for loopbench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f148917ec401fcdb45adfae6dc41a5ef1dbcadae4536907d1cee95ab1a869da5`
MD5	`a92f32e323c94fa4c3095ce4e300133c`
BLAKE2b-256	`2b22c0151cd3b347dd6469172319c65f2fe5b72c4c49cb4982be113a6e9d5971`

See more details on using hashes here.

File details

Details for the file loopbench-0.2.0-py3-none-any.whl.

File metadata

Download URL: loopbench-0.2.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 36.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for loopbench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2343b159d1c2383af22c661b9f6832ec38a7aa4b6485cc8df6d90d3e18813df3`
MD5	`73267bdea55a1ea857b7d17bc9b18732`
BLAKE2b-256	`942bedde026239f2872a2930253a897b286cf1226d2bfd3ced777fbfb4f2bc35`

See more details on using hashes here.

loopbench 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LoopBench

What LoopBench measures

The measurement stack

Suites and tasks (v0.2)

Live leaderboard

Validate and reproduce

Beat maintainer LES (good-first #4)

Score in 2 minutes

Metrics explained

Who this is for

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes