Find the breaking point of your local LLM hardware.
Project description
⚡ stressllm
Find the breaking point of your local LLM hardware.
stressllm is a CLI benchmarking tool that finds the "Performance Cliff" of your local setup. It progressively grows the context window and measures tokens-per-second, latency, VRAM usage, GPU temperature, and RAM pressure — then tells you exactly where your hardware gives up.
Quick Start
pip install stressllm
# Stress test a model via Ollama
stressllm run gemma2 --depth 3
# Check your hardware and dependencies
stressllm info
Prerequisites
| Requirement | Required? | Notes |
|---|---|---|
| Python 3.9+ | Yes | |
| Ollama | Yes (for run) |
Must be running: ollama serve |
| NVIDIA GPU + drivers | Optional | Enables VRAM and temperature monitoring |
| llama-cpp-python | Optional | Only needed for check command |
stressllm checks for Ollama on startup and will tell you exactly what's missing if something isn't right.
Installation
# Basic install (Ollama stress testing)
pip install stressllm
# With GPU monitoring
pip install stressllm[gpu]
# With direct .gguf file analysis
pip install stressllm[gguf]
# Everything
pip install stressllm[all]
For development:
git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all]"
Usage
stressllm run — Stress test via Ollama
stressllm run gemma2 --depth 3
Progressively fills the context window (2k → 8k → 32k → ...) and measures performance at each step.
| Option | Default | Description |
|---|---|---|
--depth |
3 | Context steps (1–5). Higher = larger contexts tested. |
--timeout |
300 | Max seconds per context step. 0 = no limit. |
--verbose |
off | Show detected hardware and dependency info before the test. |
--json |
off | Output results as JSON for scripting and CI. |
Example output:
╭─────────────────────────────────────────────────────╮
│ ⚡ stressllm — Stress Testing: gemma2 │
│ NVIDIA RTX 4090 · 24GB VRAM · 64GB RAM │
╰─────────────────────────────────────────────────────╯
Context TPS TTFT VRAM GPU Temp RAM Status
─────── ───── ────── ────── ──────── ───── ──────
2k 45.2 120ms 34.2% 52°C 41% ✅ Smooth
8k 38.7 340ms 58.1% 61°C 43% ✅ Smooth
32k 12.1 1.4s 89.3% 74°C 52% ⚠️ Slowing
128k 2.3 8.2s 97.8% 82°C 68% 💀 Cliff
╭─────────────────────────────────────────────────────╮
│ Verdict: gemma2 runs well up to 8k context. │
│ Performance cliff detected at 32k. │
╰─────────────────────────────────────────────────────╯
stressllm check — Direct .gguf analysis
stressllm check ./models/gemma-2b-q4.gguf --n-gpu -1
Loads a .gguf file directly into memory (no Ollama needed) and benchmarks it.
| Option | Default | Description |
|---|---|---|
--n-gpu |
-1 | GPU layers to offload (-1 = all). |
--depth |
3 | Context steps (1–5). |
Requires llama-cpp-python: pip install stressllm[gguf]
stressllm info — Hardware & dependency check
stressllm info
Shows detected GPU, RAM, CPU cores, dependency status (Ollama, pynvml, llama-cpp-python), depth level reference, and status legend. Useful for debugging and issue reports.
stressllm models — List available models
stressllm models
Lists all models pulled in Ollama with their size and a ready-to-copy run command for each one.
Known Limitations
- TPS measures generation speed. The model generates 32 tokens at each context step to measure real-world output speed. TTFT (time to first token) measures how fast the model processes your input context.
- High depths are slow. Depth 4 (128k) and depth 5 (512k) can take several minutes per step. Start with
--depth 1or--depth 2to verify things work before going deeper. Each step has a default timeout of 5 minutes — use--timeout 120to shorten it or--timeout 0for no limit. - Ctrl+C works during tests. If a step is taking too long, press Ctrl+C to stop and see partial results for steps already completed.
- GPU metrics are NVIDIA-only. AMD and Apple Silicon GPUs won't report VRAM or temperature. The tool still works in CPU-only mode with RAM and CPU% metrics.
- Model names must be exact. Use the full name including the tag —
gemma:2b, notgemma. Runstressllm modelsto see exact names available on your machine.
How It Works
stressllm forces the model to allocate progressively larger KV caches by setting num_ctx on each Ollama request. It generates prompts from a pool of 1000 common English words (each word ≈ 1 token) to accurately fill the context window:
| Depth | Context Steps Tested |
|---|---|
| 1 | 2k |
| 2 | 2k → 8k |
| 3 | 2k → 8k → 32k |
| 4 | 2k → 8k → 32k → 128k |
| 5 | 2k → 8k → 32k → 128k → 512k |
At each step, it measures tokens-per-second (TPS), time-to-first-token (TTFT), and hardware telemetry. The "Performance Cliff" is the context size where TPS drops below usable thresholds:
- TPS > 15 → ✅ Smooth
- TPS 5–15 → ⚠️ Slowing
- TPS < 5 → 💀 Cliff
FAQ
What if I don't have a GPU? stressllm works fine in CPU-only mode. GPU columns are replaced with CPU% and the verdict adapts accordingly.
What models work?
Any model available in Ollama. Run ollama list to see what you have pulled.
How accurate is this? The synthetic prompts stress the KV cache but don't perfectly replicate real workloads. Use the results as a ceiling — real-world performance may vary based on prompt complexity.
I get different results on back-to-back runs? Normal. Results can vary ±20% between runs due to thermal throttling, background system load, Ollama's KV cache state, and VRAM fragmentation. If a context size flips between "Slowing" and "Cliff" across runs, that's your borderline — treat it as the edge of what your hardware can handle.
Ollama isn't detected but it's running?
Make sure it's serving on the default port: http://localhost:11434. Check with curl http://localhost:11434/api/tags.
Contributing
See CONTRIBUTING.md for the full guide. Quick version:
git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all,dev]"
# Verify
stressllm info
# Run tests and checks
pytest
ruff check src/
bandit -r src/
Issues and PRs welcome. Please keep the code simple — this is a CLI tool, not a framework.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stressllm-0.1.0.tar.gz.
File metadata
- Download URL: stressllm-0.1.0.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c65e746afd2f16d55285c625b37d260ae21a74ee404d1032fa136583fa97531
|
|
| MD5 |
f457f8076eb2d03d099e06e45023b48e
|
|
| BLAKE2b-256 |
f465cd36e34e2d0b5e7cb21b5d458a4efa9832964bd8a1fec38538fadbbde654
|
File details
Details for the file stressllm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: stressllm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1693912a5885ca14e3bcf98b46a16e0234f6c49bee6ca65b6e28534c13f2a499
|
|
| MD5 |
22d6bca04f709079ff855ce22f497e39
|
|
| BLAKE2b-256 |
997dfa54313500605262777fda20c92392249cb9816c0c9776d24630f4ff5dc2
|