Terminal benchmark runner with automatic discovery of local LLM servers
Project description
llm-speed
Benchmark local LLM runtimes without confusing runtime speed with model differences.
Highlights
- Finds local Ollama and OpenAI-compatible servers without configuration.
- Measures TTFT, total latency, prompt throughput, generation throughput, RAM, CPU, and macOS energy.
- Keeps cross-engine comparisons honest with pinned weights, quantization, revisions, and checksums.
- Installs, launches, tunes, and stops 15 supported runtimes one at a time.
- Stores every tuning run in SQLite and flags performance regressions.
Demo
llm-speed discover
llm-speed run --all-models --max-tokens 256 --output reports/run.json
Overview
Local inference speed is not one number. It is a function of the model, exact weights, quantization,
runtime, launch flags, context length, and hardware state. llm-speed controls those variables,
runs streamed benchmarks, and records enough evidence to make the result useful later. It is built
for people comparing local LLM servers on their own machines.
Motivation
A benchmark is easy to write and easy to get wrong. Comparing an Ollama tag against a different MLX
snapshot mostly measures different artifacts, not different engines. Counting characters as tokens
can move the winner again. llm-speed treats model identity and token provenance as part of the
benchmark, then refuses strict ranking when those facts are missing.
Features
- Concurrent discovery across common localhost ports and custom URLs.
- Ollama and OpenAI-compatible streaming drivers.
- Generation, concurrency, long-context, prompt-processing, deterministic quality, MMLU, GPQA, and HumanEval benchmarks.
- Exact server counters with an isolated tokenizer fallback for pinned local models.
- Managed profiles for Ollama, LM Studio, llama.cpp, LocalAI, vLLM, text-generation-webui, and nine MLX runtimes.
- Verified GGUF downloads and immutable Hugging Face snapshots.
- Declarative configuration matrices with up to 256 variants per profile.
- JSON, CSV, HTML, SQLite history, comparisons, and regression detection.
- CPU, RAM, thermal-state, and optional CPU/GPU/ANE energy collection.
- Benchmark and launch-profile plugins through Python entry points.
Architecture
Components:
discoveryprobes local endpoints and identifies their protocol.driversnormalize streaming responses and token accounting.model_matrixandmodel_acquisitionprove which artifact each engine receives.tuninginstalls engines, isolates processes, expands variants, monitors resources, and ranks runs.benchmarks,reporting, andhistoryturn measurements into reusable evidence.
Flow: discover or launch -> warm up -> stream -> measure -> validate -> rank -> persist.
Tech Stack
- Python 3.11+ with a dependency-free runtime.
argparse,urllib,sqlite3, andsubprocessfrom the standard library.- Hatchling for wheel and sdist builds.
- Ruff, strict Mypy, and unittest-compatible tests.
- GitHub Actions on Python 3.11, 3.12, and 3.13.
Quick Start
Install the PyPI distribution; the command remains llm-speed:
pip install local-llm-speed
llm-speed discover
See docs/setup.md for authentication, managed engines, quality suites, and macOS energy measurement.
Usage
Benchmark servers that are already running:
llm-speed run --all-models --max-tokens 256 --output reports/run.json
Run a strict comparison on one pinned MLX snapshot:
llm-speed tune \
--model qwen3-0.6b-mlx-4bit \
--model-matrix model-matrix.qwen3-0.6b-mlx.json \
--engine mlx-lm \
--engine vllm-mlx \
--full-suite
Inspect stored results:
llm-speed history
llm-speed compare 12 13
Project Structure
src/llm_speed/
benchmarks/ benchmark contracts and built-in suites
drivers/ Ollama and OpenAI-compatible protocols
tuning/ installation, lifecycle, monitoring, and ranking
cli.py command-line entry point
model_matrix.py strict cross-engine artifact identity
tests/ unit and lifecycle tests with fake local servers
model-matrix.*.json reproducible MLX and GGUF examples
profiles.example.json
Status
Stage: Experimental (0.2.x, PyPI classifier: Alpha).
Planned:
- Future work is tracked through GitHub issues and benchmark evidence.
Testing
python -m unittest discover -s tests -v
ruff check .
mypy src
python -m build
Contributing
See CONTRIBUTING.md.
MIT - see LICENSE
If you like this project, please give it a star ⭐
For questions, feedback, or support, reach out to:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file local_llm_speed-0.2.1.tar.gz.
File metadata
- Download URL: local_llm_speed-0.2.1.tar.gz
- Upload date:
- Size: 58.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5a0eaec5ed3d24b5a1c9d009d7fee9fc32c0eacaaa3216ffbff9c42ca94039c
|
|
| MD5 |
90952cb46c0461d3f20645b830c8cbe5
|
|
| BLAKE2b-256 |
781e53c9bd8a962689b465d609554d55220cdbae7c9f7de559f5cd89ebb7806d
|
Provenance
The following attestation bundles were made for local_llm_speed-0.2.1.tar.gz:
Publisher:
release.yml on KazKozDev/llm-speed
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
local_llm_speed-0.2.1.tar.gz -
Subject digest:
d5a0eaec5ed3d24b5a1c9d009d7fee9fc32c0eacaaa3216ffbff9c42ca94039c - Sigstore transparency entry: 1927778409
- Sigstore integration time:
-
Permalink:
KazKozDev/llm-speed@bcb2786b27a5267e03670e2717534e71f177640b -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/KazKozDev
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bcb2786b27a5267e03670e2717534e71f177640b -
Trigger Event:
push
-
Statement type:
File details
Details for the file local_llm_speed-0.2.1-py3-none-any.whl.
File metadata
- Download URL: local_llm_speed-0.2.1-py3-none-any.whl
- Upload date:
- Size: 58.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e54f1ed79abde019b44f8f2b72cbc9415e1f58ca4261addfa45ebc94731b2a95
|
|
| MD5 |
04d6a32694b5341c58f5745668e53360
|
|
| BLAKE2b-256 |
3c799ceb6e6f1a84402c47f2d78a962a8789991643658f561d2e3b1d2dc521ae
|
Provenance
The following attestation bundles were made for local_llm_speed-0.2.1-py3-none-any.whl:
Publisher:
release.yml on KazKozDev/llm-speed
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
local_llm_speed-0.2.1-py3-none-any.whl -
Subject digest:
e54f1ed79abde019b44f8f2b72cbc9415e1f58ca4261addfa45ebc94731b2a95 - Sigstore transparency entry: 1927778772
- Sigstore integration time:
-
Permalink:
KazKozDev/llm-speed@bcb2786b27a5267e03670e2717534e71f177640b -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/KazKozDev
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bcb2786b27a5267e03670e2717534e71f177640b -
Trigger Event:
push
-
Statement type: