Skip to main content

Local-first CLI + web dashboard for benchmarking LLMs across quality, speed, reliability, and a real multi-turn agent loop. Hardware-aware, deterministic, reproducible.

Project description

BenchLoop

BenchLoop

site pypi MIT beta

Benchmark local LLMs by what actually matters.

BenchLoop is a local-first CLI + web app for benchmarking LLMs running on your own hardware. It scores models across seven repeatable suites — quality, speed, reliability, agentic tool use, coding, instruction following — and gives you receipts: per-task outputs, latency, token counts, machine info, scores.

No accounts, no telemetry, no API keys. Your model, your machine, your numbers.

$ benchloop run --model qwen3:8b --suites speed,toolcall,agent
... 8 tasks, 4 tools, 6 turns avg, 74.6 tok/s ...

Overall  73.4  ████████░░
Quality  73.6  ████████░░
Speed    78.9  █████████░
Agent    96.9  █████████▌

Published runs live at https://bench-loop.com/leaderboard. Every completed local benchmark auto-publishes there.

Why

Hosted LLM leaderboards answer "which model wins on a server farm someone else paid for?" BenchLoop answers "which model + harness + hardware combination actually works for me right now?" — the question you have when picking a local stack.

It is repeatable on purpose: every run persists to disk, the task set is frozen, the scorer is deterministic. If you say "qwen3:8b scored 89 on my 4090", anyone can install BenchLoop and verify it.

Install

pipx (recommended)

pipx install benchloop-cli
benchloop --version

The PyPI distribution is named benchloop-cli (the bare benchloop name was taken by an unrelated dataset library). The installed commands are still benchloop and bench-loop.

pip

pip install benchloop-cli

From source

git clone https://github.com/outsourc-e/bench-loop
cd bench-loop
pip install -e .

Run your first benchmark

Make sure you have a local LLM endpoint running. Anything OpenAI-compatible or Ollama-flavored works:

  • Ollama at http://localhost:11434 (default)
  • LM Studio at http://localhost:1234 (--provider openai_compat)
  • MLX / Osaurus at http://localhost:8000 (--provider openai_compat)
  • vLLM, Jan, llama-server, etc.

Then:

benchloop run \
  --model qwen3:8b \
  --endpoint http://localhost:11434 \
  --provider ollama

This runs every default suite, scores them, prints a console report, and persists the full run to ~/.bench-loop/runs/.

Run a subset

benchloop run --model qwen3:8b --suites speed,agent

Different prompting harness

Same model, four ways to talk to it:

benchloop run --model qwen3:8b --harness raw      # native tool calling
benchloop run --model qwen3:8b --harness hermes   # <tool_call>{...}</tool_call>
benchloop run --model qwen3:8b --harness qwen     # <function_call>{...}</function_call>
benchloop run --model qwen3:8b --harness pi       # <think>...</think> + Hermes tags

Stamp custom hardware (e.g. when benchmarking through a tunnel)

benchloop run \
  --model qwen3:8b \
  --endpoint http://localhost:11435 \
  --hardware "NVIDIA RTX 4090 24GB" \
  --gpu "NVIDIA RTX 4090" \
  --gpu-memory-gb 24

Launch the local dashboard

v0.2.0+ ships the full FastAPI + React dashboard inside the wheel. After pipx install benchloop-cli:

benchloop dashboard
# → open http://127.0.0.1:8877

This serves the Models, Benchmark, Leaderboard, Compare, and Chat tabs on a single port, with auto-discovered local providers (Ollama, LM Studio, MLX/Osaurus, vLLM, Jan).

For hot-reload development against a clone of bench-loop-web:

benchloop dashboard --dev

Suites

Suite What it scores
speed Latency, throughput, TTFT, generation tok/s across short/medium/long contexts
toolcall Structured tool-call correctness across realistic tasks (weather, stocks, email, search)
coding Executable Python tasks verified in a sandboxed subprocess (10s timeout)
dataextract JSON / structured extraction from messy natural language
instructfollow Constraint following, formatting, exactness
reasonmath Small reasoning + math tasks with deterministic checks
agent Multi-turn agentic tool use. BenchLoop drives a real loop: model emits a tool call, BenchLoop executes it locally, feeds the result back, model iterates until done. Scores correctness, efficiency, no-hallucination, required-tool coverage.

Scoring

Overall = 0.55 · quality + 0.20 · speed + 0.25 · reliability
  • Quality = mean of non-speed suite scores (size-fair).
  • Speed = 12.54 · log2(tok/s) + 0.9, clamped to 0–100.
  • Reliability = pass rate across all tasks.
  • Agent = correct_final + efficient + no_hallucinated_tools + all_required_called, 25 pts each, averaged across tasks.

Local web app

A FastAPI backend + React frontend bundle ships alongside the CLI for visualizing runs:

benchloop dashboard   # starts the local web app on :5180

Tabs: Models, Benchmark, Leaderboard, Compare runs, Chat, agent trace viewer.

Publish a run

Every completed benchmark auto-publishes to https://bench-loop.com/leaderboard via https://api.bench-loop.com/submit. Runs are deduped by (machine_id, run_id) so the same run from the same machine won't be double-counted.

Opt out:

export BENCHLOOP_NO_SUBMIT=1

You can still manually export a snapshot for sharing / archiving:

benchloop export --output my-runs.json

Architecture

bench-loop/                    ← this repo, the CLI + suites + scorers
  bench_loop/
    cli.py                     ← `benchloop` entrypoint
    suites/                    ← speed, toolcall, coding, agent, ...
    harness.py                 ← raw / hermes / qwen / pi adapters
    providers/                 ← ollama, openai_compat
    runner/orchestrator.py     ← drives suites + harnesses
    tasks/                     ← frozen task YAML fixtures
bench-loop-web/                ← the web app (separate repo)
  api/                         ← FastAPI wrapper around bench_loop
  ui/                          ← local dashboard
  site/                        ← public bench-loop.com static site

Status

BenchLoop is v0.1 beta. The benchmark surface, scoring, web app, agent loop, and four harnesses all work end-to-end. Stuff still on the roadmap:

  • Streaming TTFT for OpenAI-compatible providers (currently 0 on those backends — ollama TTFT is fine)
  • Bigger task fixtures (each suite is intentionally small and frozen for v1)
  • Hosted submission flow for community runs
  • More provider adapters (TGI, Bedrock, etc. if there's demand)

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchloop_cli-0.2.1.tar.gz (650.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchloop_cli-0.2.1-py3-none-any.whl (627.0 kB view details)

Uploaded Python 3

File details

Details for the file benchloop_cli-0.2.1.tar.gz.

File metadata

  • Download URL: benchloop_cli-0.2.1.tar.gz
  • Upload date:
  • Size: 650.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for benchloop_cli-0.2.1.tar.gz
Algorithm Hash digest
SHA256 1e92e1c9f2248f9e98130ae2cd197b1ab339657592704f7b3b62dcaf895bab4c
MD5 9c4e8c7096a57c6ba8e717a94c55aecf
BLAKE2b-256 ac92a224ed729593f23f48d98b5333f7c0e69013be526dfd6f7fedd972c2684f

See more details on using hashes here.

File details

Details for the file benchloop_cli-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: benchloop_cli-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 627.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for benchloop_cli-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 51f5ef6a4ac2d82874acf8b0e4dd588a7098e77a1d8d7d79e30776ace7bc1372
MD5 682b20050508b811d0ff12bb2e790709
BLAKE2b-256 4a20f97c730ea7a7eb03851207158db8963478cd756cd956fbfa20e6bc3d0a0c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page