Skip to main content

Find the breaking point of your local LLM hardware.

Project description

⚡ stressllm

Find the breaking point of your local LLM hardware.

stressllm is a CLI benchmarking tool that finds the "Performance Cliff" of your local setup. It progressively grows the context window and measures tokens-per-second, latency, VRAM usage, GPU temperature, and RAM pressure — then tells you exactly where your hardware gives up.

Quick Start

pip install stressllm

# Stress test a model via Ollama
stressllm run gemma2 --depth 3

# Check your hardware and dependencies
stressllm info

Prerequisites

Requirement Required? Notes
Python 3.9+ Yes
Ollama Yes (for run) Must be running: ollama serve
NVIDIA GPU + drivers Optional Enables VRAM and temperature monitoring
llama-cpp-python Optional Only needed for check command

stressllm checks for Ollama on startup and will tell you exactly what's missing if something isn't right.

Installation

# Basic install (Ollama stress testing)
pip install stressllm

# With GPU monitoring
pip install stressllm[gpu]

# With direct .gguf file analysis
pip install stressllm[gguf]

# Everything
pip install stressllm[all]

For development:

git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all]"

Usage

stressllm run — Stress test via Ollama

stressllm run gemma2 --depth 3

Progressively fills the context window (2k → 8k → 32k → ...) and measures performance at each step.

Option Default Description
--depth 3 Context steps (1–5). Higher = larger contexts tested.
--timeout 300 Max seconds per context step. 0 = no limit.
--verbose off Show detected hardware and dependency info before the test.
--json off Output results as JSON for scripting and CI.

Example output:

╭─────────────────────────────────────────────────────╮
│  ⚡ stressllm — Stress Testing: gemma2              │
│  NVIDIA RTX 4090 · 24GB VRAM · 64GB RAM             │
╰─────────────────────────────────────────────────────╯

 Context   TPS     TTFT      VRAM     GPU Temp   RAM     Status
 ───────   ─────   ──────    ──────   ────────   ─────   ──────
 2k        45.2    120ms     34.2%    52°C       41%     ✅ Smooth
 8k        38.7    340ms     58.1%    61°C       43%     ✅ Smooth
 32k       12.1    1.4s      89.3%    74°C       52%     ⚠️  Slowing
 128k      2.3     8.2s      97.8%    82°C       68%     💀 Cliff

╭─────────────────────────────────────────────────────╮
│  Verdict: gemma2 runs well up to 8k context.        │
│  Performance cliff detected at 32k.                 │
╰─────────────────────────────────────────────────────╯

stressllm check — Direct .gguf analysis

stressllm check ./models/gemma-2b-q4.gguf --n-gpu -1

Loads a .gguf file directly into memory (no Ollama needed) and benchmarks it.

Option Default Description
--n-gpu -1 GPU layers to offload (-1 = all).
--depth 3 Context steps (1–5).

Requires llama-cpp-python: pip install stressllm[gguf]

stressllm info — Hardware & dependency check

stressllm info

Shows detected GPU, RAM, CPU cores, dependency status (Ollama, pynvml, llama-cpp-python), depth level reference, and status legend. Useful for debugging and issue reports.

stressllm models — List available models

stressllm models

Lists all models pulled in Ollama with their size and a ready-to-copy run command for each one.

Known Limitations

  • TPS measures generation speed. The model generates 32 tokens at each context step to measure real-world output speed. TTFT (time to first token) measures how fast the model processes your input context.
  • High depths are slow. Depth 4 (128k) and depth 5 (512k) can take several minutes per step. Start with --depth 1 or --depth 2 to verify things work before going deeper. Each step has a default timeout of 5 minutes — use --timeout 120 to shorten it or --timeout 0 for no limit.
  • Ctrl+C works during tests. If a step is taking too long, press Ctrl+C to stop and see partial results for steps already completed.
  • GPU metrics are NVIDIA-only. AMD and Apple Silicon GPUs won't report VRAM or temperature. The tool still works in CPU-only mode with RAM and CPU% metrics.
  • Model names must be exact. Use the full name including the tag — gemma:2b, not gemma. Run stressllm models to see exact names available on your machine.

How It Works

stressllm forces the model to allocate progressively larger KV caches by setting num_ctx on each Ollama request. It generates prompts from a pool of 1000 common English words (each word ≈ 1 token) to accurately fill the context window:

Depth Context Steps Tested
1 2k
2 2k → 8k
3 2k → 8k → 32k
4 2k → 8k → 32k → 128k
5 2k → 8k → 32k → 128k → 512k

At each step, it measures tokens-per-second (TPS), time-to-first-token (TTFT), and hardware telemetry. The "Performance Cliff" is the context size where TPS drops below usable thresholds:

  • TPS > 15 → ✅ Smooth
  • TPS 5–15 → ⚠️ Slowing
  • TPS < 5 → 💀 Cliff

FAQ

What if I don't have a GPU? stressllm works fine in CPU-only mode. GPU columns are replaced with CPU% and the verdict adapts accordingly.

What models work? Any model available in Ollama. Run ollama list to see what you have pulled.

How accurate is this? The synthetic prompts stress the KV cache but don't perfectly replicate real workloads. Use the results as a ceiling — real-world performance may vary based on prompt complexity.

I get different results on back-to-back runs? Normal. Results can vary ±20% between runs due to thermal throttling, background system load, Ollama's KV cache state, and VRAM fragmentation. If a context size flips between "Slowing" and "Cliff" across runs, that's your borderline — treat it as the edge of what your hardware can handle.

Ollama isn't detected but it's running? Make sure it's serving on the default port: http://localhost:11434. Check with curl http://localhost:11434/api/tags.

Contributing

See CONTRIBUTING.md for the full guide. Quick version:

git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all,dev]"

# Verify
stressllm info

# Run tests and checks
pytest
ruff check src/
bandit -r src/

Issues and PRs welcome. Please keep the code simple — this is a CLI tool, not a framework.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stressllm-0.1.1.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stressllm-0.1.1-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file stressllm-0.1.1.tar.gz.

File metadata

  • Download URL: stressllm-0.1.1.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for stressllm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 768531337fabcf4e74eea474134c71f2973aabdb909d9dcde9647693d42bc8ad
MD5 a121f2203d7dd3270cc32d10b4413441
BLAKE2b-256 406096ef0d4bc2baaa32a6cb5f71967318e6d4bd27aee7b6eaa5e561d58256c4

See more details on using hashes here.

File details

Details for the file stressllm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: stressllm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for stressllm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0f7df936232371e4541bf48786e5fc344602fe01e7a228538dad349470f60071
MD5 d9e954f50921fc49869cc4b672f8f181
BLAKE2b-256 42587c5a0b91b30e5495ad9e78ffdbe8777eed0d7d5c6a079476160c02dbb5cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page