Skip to main content

Find the breaking point of your local LLM hardware.

Project description

⚡ stressllm

Find the breaking point of your local LLM hardware.

stressllm is a CLI benchmarking tool that finds the "Performance Cliff" of your local setup. It progressively grows the context window and measures tokens-per-second, latency, VRAM usage, GPU temperature, and RAM pressure — then tells you exactly where your hardware gives up.

Quick Start

pip install stressllm

# Stress test a model via Ollama
stressllm run gemma2 --depth 3

# Check your hardware and dependencies
stressllm info

Prerequisites

Requirement Required? Notes
Python 3.9+ Yes
Ollama Yes (for run) Must be running: ollama serve
NVIDIA GPU + drivers Optional Enables VRAM and temperature monitoring
llama-cpp-python Optional Only needed for check command

stressllm checks for Ollama on startup and will tell you exactly what's missing if something isn't right.

Installation

# Basic install (Ollama stress testing)
pip install stressllm

# With GPU monitoring
pip install stressllm[gpu]

# With direct .gguf file analysis
pip install stressllm[gguf]

# Everything
pip install stressllm[all]

For development:

git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all]"

Usage

stressllm run — Stress test via Ollama

stressllm run gemma2 --depth 3

Progressively fills the context window (2k → 8k → 32k → ...) and measures performance at each step.

Option Default Description
--depth 3 Context steps (1–5). Higher = larger contexts tested.
--timeout 300 Max seconds per context step. 0 = no limit.
--verbose off Show detected hardware and dependency info before the test.
--json off Output results as JSON for scripting and CI.

Example output:

╭─────────────────────────────────────────────────────╮
│  ⚡ stressllm — Stress Testing: gemma2              │
│  NVIDIA RTX 4090 · 24GB VRAM · 64GB RAM             │
╰─────────────────────────────────────────────────────╯

 Context   TPS     TTFT      VRAM     GPU Temp   RAM     Status
 ───────   ─────   ──────    ──────   ────────   ─────   ──────
 2k        45.2    120ms     34.2%    52°C       41%     ✅ Smooth
 8k        38.7    340ms     58.1%    61°C       43%     ✅ Smooth
 32k       12.1    1.4s      89.3%    74°C       52%     ⚠️  Slowing
 128k      2.3     8.2s      97.8%    82°C       68%     💀 Cliff

╭─────────────────────────────────────────────────────╮
│  Verdict: gemma2 runs well up to 8k context.        │
│  Performance cliff detected at 32k.                 │
╰─────────────────────────────────────────────────────╯

stressllm check — Direct .gguf analysis

stressllm check ./models/gemma-2b-q4.gguf --n-gpu -1

Loads a .gguf file directly into memory (no Ollama needed) and benchmarks it.

Option Default Description
--n-gpu -1 GPU layers to offload (-1 = all).
--depth 3 Context steps (1–5).

Requires llama-cpp-python: pip install stressllm[gguf]

stressllm info — Hardware & dependency check

stressllm info

Shows detected GPU, RAM, CPU cores, dependency status (Ollama, pynvml, llama-cpp-python), depth level reference, and status legend. Useful for debugging and issue reports.

stressllm models — List available models

stressllm models

Lists all models pulled in Ollama with their size and a ready-to-copy run command for each one.

Known Limitations

  • TPS measures generation speed. The model generates 32 tokens at each context step to measure real-world output speed. TTFT (time to first token) measures how fast the model processes your input context.
  • High depths are slow. Depth 4 (128k) and depth 5 (512k) can take several minutes per step. Start with --depth 1 or --depth 2 to verify things work before going deeper. Each step has a default timeout of 5 minutes — use --timeout 120 to shorten it or --timeout 0 for no limit.
  • Ctrl+C works during tests. If a step is taking too long, press Ctrl+C to stop and see partial results for steps already completed.
  • GPU metrics are NVIDIA-only. AMD and Apple Silicon GPUs won't report VRAM or temperature. The tool still works in CPU-only mode with RAM and CPU% metrics.
  • Model names must be exact. Use the full name including the tag — gemma:2b, not gemma. Run stressllm models to see exact names available on your machine.

How It Works

stressllm forces the model to allocate progressively larger KV caches by setting num_ctx on each Ollama request. It generates prompts from a pool of 1000 common English words (each word ≈ 1 token) to accurately fill the context window:

Depth Context Steps Tested
1 2k
2 2k → 8k
3 2k → 8k → 32k
4 2k → 8k → 32k → 128k
5 2k → 8k → 32k → 128k → 512k

At each step, it measures tokens-per-second (TPS), time-to-first-token (TTFT), and hardware telemetry. The "Performance Cliff" is the context size where TPS drops below usable thresholds:

  • TPS > 15 → ✅ Smooth
  • TPS 5–15 → ⚠️ Slowing
  • TPS < 5 → 💀 Cliff

FAQ

What if I don't have a GPU? stressllm works fine in CPU-only mode. GPU columns are replaced with CPU% and the verdict adapts accordingly.

What models work? Any model available in Ollama. Run ollama list to see what you have pulled.

How accurate is this? The synthetic prompts stress the KV cache but don't perfectly replicate real workloads. Use the results as a ceiling — real-world performance may vary based on prompt complexity.

I get different results on back-to-back runs? Normal. Results can vary ±20% between runs due to thermal throttling, background system load, Ollama's KV cache state, and VRAM fragmentation. If a context size flips between "Slowing" and "Cliff" across runs, that's your borderline — treat it as the edge of what your hardware can handle.

Ollama isn't detected but it's running? Make sure it's serving on the default port: http://localhost:11434. Check with curl http://localhost:11434/api/tags.

Contributing

See CONTRIBUTING.md for the full guide. Quick version:

git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all,dev]"

# Verify
stressllm info

# Run tests and checks
pytest
ruff check src/
bandit -r src/

Issues and PRs welcome. Please keep the code simple — this is a CLI tool, not a framework.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stressllm-0.1.0.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stressllm-0.1.0-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file stressllm-0.1.0.tar.gz.

File metadata

  • Download URL: stressllm-0.1.0.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for stressllm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3c65e746afd2f16d55285c625b37d260ae21a74ee404d1032fa136583fa97531
MD5 f457f8076eb2d03d099e06e45023b48e
BLAKE2b-256 f465cd36e34e2d0b5e7cb21b5d458a4efa9832964bd8a1fec38538fadbbde654

See more details on using hashes here.

File details

Details for the file stressllm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: stressllm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for stressllm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1693912a5885ca14e3bcf98b46a16e0234f6c49bee6ca65b6e28534c13f2a499
MD5 22d6bca04f709079ff855ce22f497e39
BLAKE2b-256 997dfa54313500605262777fda20c92392249cb9816c0c9776d24630f4ff5dc2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page