Find the breaking point of your local LLM hardware.

These details have not been verified by PyPI

Project links

Project description

⚡ stressllm

Find the breaking point of your local LLM hardware.

stressllm is a CLI benchmarking tool that finds the "Performance Cliff" of your local setup. It progressively grows the context window and measures tokens-per-second, latency, VRAM usage, GPU temperature, and RAM pressure — then tells you exactly where your hardware gives up.

Quick Start

pip install stressllm

# Stress test a model via Ollama
stressllm run gemma2 --depth 3

# Check your hardware and dependencies
stressllm info

Prerequisites

Requirement	Required?	Notes
Python 3.9+	Yes
Ollama	Yes (for `run`)	Must be running: `ollama serve`
NVIDIA GPU + drivers	Optional	Enables VRAM and temperature monitoring
llama-cpp-python	Optional	Only needed for `check` command

stressllm checks for Ollama on startup and will tell you exactly what's missing if something isn't right.

Installation

# Basic install (Ollama stress testing)
pip install stressllm

# With GPU monitoring
pip install stressllm[gpu]

# With direct .gguf file analysis
pip install stressllm[gguf]

# Everything
pip install stressllm[all]

For development:

git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all]"

Usage

`stressllm run` — Stress test via Ollama

stressllm run gemma2 --depth 3

Progressively fills the context window (2k → 8k → 32k → ...) and measures performance at each step.

Option	Default	Description
`--depth`	3	Context steps (1–5). Higher = larger contexts tested.
`--timeout`	300	Max seconds per context step. 0 = no limit.
`--verbose`	off	Show detected hardware and dependency info before the test.
`--json`	off	Output results as JSON for scripting and CI.

Example output:

╭─────────────────────────────────────────────────────╮
│  ⚡ stressllm — Stress Testing: gemma2              │
│  NVIDIA RTX 4090 · 24GB VRAM · 64GB RAM             │
╰─────────────────────────────────────────────────────╯

 Context   TPS     TTFT      VRAM     GPU Temp   RAM     Status
 ───────   ─────   ──────    ──────   ────────   ─────   ──────
 2k        45.2    120ms     34.2%    52°C       41%     ✅ Smooth
 8k        38.7    340ms     58.1%    61°C       43%     ✅ Smooth
 32k       12.1    1.4s      89.3%    74°C       52%     ⚠️  Slowing
 128k      2.3     8.2s      97.8%    82°C       68%     💀 Cliff

╭─────────────────────────────────────────────────────╮
│  Verdict: gemma2 runs well up to 8k context.        │
│  Performance cliff detected at 32k.                 │
╰─────────────────────────────────────────────────────╯

`stressllm check` — Direct .gguf analysis

stressllm check ./models/gemma-2b-q4.gguf --n-gpu -1

Loads a .gguf file directly into memory (no Ollama needed) and benchmarks it.

Option	Default	Description
`--n-gpu`	-1	GPU layers to offload (-1 = all).
`--depth`	3	Context steps (1–5).

Requires llama-cpp-python: pip install stressllm[gguf]

`stressllm info` — Hardware & dependency check

stressllm info

Shows detected GPU, RAM, CPU cores, dependency status (Ollama, pynvml, llama-cpp-python), depth level reference, and status legend. Useful for debugging and issue reports.

`stressllm models` — List available models

stressllm models

Lists all models pulled in Ollama with their size and a ready-to-copy run command for each one.

Known Limitations

TPS measures generation speed. The model generates 32 tokens at each context step to measure real-world output speed. TTFT (time to first token) measures how fast the model processes your input context.
High depths are slow. Depth 4 (128k) and depth 5 (512k) can take several minutes per step. Start with --depth 1 or --depth 2 to verify things work before going deeper. Each step has a default timeout of 5 minutes — use --timeout 120 to shorten it or --timeout 0 for no limit.
Ctrl+C works during tests. If a step is taking too long, press Ctrl+C to stop and see partial results for steps already completed.
GPU metrics are NVIDIA-only. AMD and Apple Silicon GPUs won't report VRAM or temperature. The tool still works in CPU-only mode with RAM and CPU% metrics.
Model names must be exact. Use the full name including the tag — gemma:2b, not gemma. Run stressllm models to see exact names available on your machine.

How It Works

stressllm forces the model to allocate progressively larger KV caches by setting num_ctx on each Ollama request. It generates prompts from a pool of 1000 common English words (each word ≈ 1 token) to accurately fill the context window:

Depth	Context Steps Tested
1	2k
2	2k → 8k
3	2k → 8k → 32k
4	2k → 8k → 32k → 128k
5	2k → 8k → 32k → 128k → 512k

At each step, it measures tokens-per-second (TPS), time-to-first-token (TTFT), and hardware telemetry. The "Performance Cliff" is the context size where TPS drops below usable thresholds:

TPS > 15 → ✅ Smooth
TPS 5–15 → ⚠️ Slowing
TPS < 5 → 💀 Cliff

FAQ

What if I don't have a GPU? stressllm works fine in CPU-only mode. GPU columns are replaced with CPU% and the verdict adapts accordingly.

What models work? Any model available in Ollama. Run ollama list to see what you have pulled.

How accurate is this? The synthetic prompts stress the KV cache but don't perfectly replicate real workloads. Use the results as a ceiling — real-world performance may vary based on prompt complexity.

I get different results on back-to-back runs? Normal. Results can vary ±20% between runs due to thermal throttling, background system load, Ollama's KV cache state, and VRAM fragmentation. If a context size flips between "Slowing" and "Cliff" across runs, that's your borderline — treat it as the edge of what your hardware can handle.

Ollama isn't detected but it's running? Make sure it's serving on the default port: http://localhost:11434. Check with curl http://localhost:11434/api/tags.

Contributing

See CONTRIBUTING.md for the full guide. Quick version:

git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all,dev]"

# Verify
stressllm info

# Run tests and checks
pytest
ruff check src/
bandit -r src/

Issues and PRs welcome. Please keep the code simple — this is a CLI tool, not a framework.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Mar 1, 2026

0.1.0

Mar 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stressllm-0.1.1.tar.gz (20.0 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stressllm-0.1.1-py3-none-any.whl (17.9 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file stressllm-0.1.1.tar.gz.

File metadata

Download URL: stressllm-0.1.1.tar.gz
Upload date: Mar 1, 2026
Size: 20.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for stressllm-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`768531337fabcf4e74eea474134c71f2973aabdb909d9dcde9647693d42bc8ad`
MD5	`a121f2203d7dd3270cc32d10b4413441`
BLAKE2b-256	`406096ef0d4bc2baaa32a6cb5f71967318e6d4bd27aee7b6eaa5e561d58256c4`

See more details on using hashes here.

File details

Details for the file stressllm-0.1.1-py3-none-any.whl.

File metadata

Download URL: stressllm-0.1.1-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 17.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for stressllm-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0f7df936232371e4541bf48786e5fc344602fe01e7a228538dad349470f60071`
MD5	`d9e954f50921fc49869cc4b672f8f181`
BLAKE2b-256	`42587c5a0b91b30e5495ad9e78ffdbe8777eed0d7d5c6a079476160c02dbb5cb`

See more details on using hashes here.

stressllm 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

⚡ stressllm

Quick Start

Prerequisites

Installation

Usage

`stressllm run` — Stress test via Ollama

`stressllm check` — Direct .gguf analysis

`stressllm info` — Hardware & dependency check

`stressllm models` — List available models

Known Limitations

How It Works

FAQ

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

stressllm 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

⚡ stressllm

Quick Start

Prerequisites

Installation

Usage

stressllm run — Stress test via Ollama

stressllm check — Direct .gguf analysis

stressllm info — Hardware & dependency check

stressllm models — List available models

Known Limitations

How It Works

FAQ

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`stressllm run` — Stress test via Ollama

`stressllm check` — Direct .gguf analysis

`stressllm info` — Hardware & dependency check

`stressllm models` — List available models