Local-first CLI + web dashboard for benchmarking LLMs across quality, speed, reliability, and a real multi-turn agent loop. Hardware-aware, deterministic, reproducible.
Project description
BenchLoop
Benchmark local LLMs by what actually matters.
BenchLoop is a local-first CLI + web app for benchmarking LLMs running on your own hardware. It scores models across seven repeatable suites — quality, speed, reliability, agentic tool use, coding, instruction following — and gives you receipts: per-task outputs, latency, token counts, machine info, scores.
No accounts, no telemetry, no API keys. Your model, your machine, your numbers.
$ benchloop run --model qwen3:8b --suites speed,toolcall,agent
... 8 tasks, 4 tools, 6 turns avg, 74.6 tok/s ...
Overall 73.4 ████████░░
Quality 73.6 ████████░░
Speed 78.9 █████████░
Agent 96.9 █████████▌
Published runs live at https://bench-loop.com/leaderboard. Every completed local benchmark auto-publishes there.
Why
Hosted LLM leaderboards answer "which model wins on a server farm someone else paid for?" BenchLoop answers "which model + harness + hardware combination actually works for me right now?" — the question you have when picking a local stack.
It is repeatable on purpose: every run persists to disk, the task set is frozen, the scorer is deterministic. If you say "qwen3:8b scored 89 on my 4090", anyone can install BenchLoop and verify it.
Install
pipx (recommended)
pipx install benchloop-cli
benchloop --version
The PyPI distribution is named
benchloop-cli(the barebenchloopname was taken by an unrelated dataset library). The installed commands are stillbenchloopandbench-loop.
pip
pip install benchloop-cli
From source
git clone https://github.com/outsourc-e/bench-loop
cd bench-loop
pip install -e .
Run your first benchmark
Make sure you have a local LLM endpoint running. Anything OpenAI-compatible or Ollama-flavored works:
- Ollama at
http://localhost:11434(default) - LM Studio at
http://localhost:1234(--provider openai_compat) - MLX / Osaurus at
http://localhost:8000(--provider openai_compat) - vLLM, Jan, llama-server, etc.
Then:
benchloop run \
--model qwen3:8b \
--endpoint http://localhost:11434 \
--provider ollama
This runs every default suite, scores them, prints a console report, and persists the full run to ~/.bench-loop/runs/.
Run a subset
benchloop run --model qwen3:8b --suites speed,agent
Different prompting harness
Same model, four ways to talk to it:
benchloop run --model qwen3:8b --harness raw # native tool calling
benchloop run --model qwen3:8b --harness hermes # <tool_call>{...}</tool_call>
benchloop run --model qwen3:8b --harness qwen # <function_call>{...}</function_call>
benchloop run --model qwen3:8b --harness pi # <think>...</think> + Hermes tags
Suites
| Suite | What it scores |
|---|---|
speed |
Latency, throughput, TTFT, generation tok/s across short/medium/long contexts |
toolcall |
Structured tool-call correctness across realistic tasks (weather, stocks, email, search) |
coding |
Executable Python tasks verified in a sandboxed subprocess (10s timeout) |
dataextract |
JSON / structured extraction from messy natural language |
instructfollow |
Constraint following, formatting, exactness |
reasonmath |
Small reasoning + math tasks with deterministic checks |
agent |
Multi-turn agentic tool use. BenchLoop drives a real loop: model emits a tool call, BenchLoop executes it locally, feeds the result back, model iterates until done. Scores correctness, efficiency, no-hallucination, required-tool coverage. |
Scoring
Overall = 0.55 · quality + 0.20 · speed + 0.25 · reliability
- Quality = mean of non-speed suite scores (size-fair).
- Speed =
12.54 · log2(tok/s) + 0.9, clamped to 0–100. - Reliability = pass rate across all tasks.
- Agent =
correct_final + efficient + no_hallucinated_tools + all_required_called, 25 pts each, averaged across tasks.
Local web app
A FastAPI backend + React frontend bundle ships alongside the CLI for visualizing runs:
benchloop dashboard # starts the local web app on :5180
Tabs: Models, Benchmark, Leaderboard, Compare runs, Chat, agent trace viewer.
Publish a run
Every completed benchmark auto-publishes to https://bench-loop.com/leaderboard via https://api.bench-loop.com/submit. Runs are deduped by (machine_id, run_id) so the same run from the same machine won't be double-counted.
Opt out:
export BENCHLOOP_NO_SUBMIT=1
You can still manually export a snapshot for sharing / archiving:
benchloop export --output my-runs.json
Architecture
bench-loop/ ← this repo, the CLI + suites + scorers
bench_loop/
cli.py ← `benchloop` entrypoint
suites/ ← speed, toolcall, coding, agent, ...
harness.py ← raw / hermes / qwen / pi adapters
providers/ ← ollama, openai_compat
runner/orchestrator.py ← drives suites + harnesses
tasks/ ← frozen task YAML fixtures
bench-loop-web/ ← the web app (separate repo)
api/ ← FastAPI wrapper around bench_loop
ui/ ← local dashboard
site/ ← public benchloop.com static site
Status
BenchLoop is v0.1 beta. The benchmark surface, scoring, web app, agent loop, and four harnesses all work end-to-end. Stuff still on the roadmap:
- Streaming TTFT for OpenAI-compatible providers (currently 0 on those backends — ollama TTFT is fine)
- Bigger task fixtures (each suite is intentionally small and frozen for v1)
- Hosted submission flow for community runs
- More provider adapters (TGI, Bedrock, etc. if there's demand)
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchloop_cli-0.1.3.tar.gz.
File metadata
- Download URL: benchloop_cli-0.1.3.tar.gz
- Upload date:
- Size: 129.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e93eb3d822365404098ec4b2722ea569ac76bce3ad4bb0479a9bd7d9e76b170
|
|
| MD5 |
a9660a463e8e5e03f59e699e6438a714
|
|
| BLAKE2b-256 |
c7e398e3163c83e46447c2556e9f72b7ab8564b165fed3ff855b220c363b3d8d
|
File details
Details for the file benchloop_cli-0.1.3-py3-none-any.whl.
File metadata
- Download URL: benchloop_cli-0.1.3-py3-none-any.whl
- Upload date:
- Size: 100.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d7f0dacd7b2aea73415b04b6926afced9514064d11e4e62a483e584d4135ee4
|
|
| MD5 |
25374b15c551a1802a7e14c183be7ad2
|
|
| BLAKE2b-256 |
18958bb5a2428bfabcbbef320e3fb19fb7e5f0389d19498c626e41e6cb34e411
|