Terminal toolkit for local Ollama model recommendation, benchmarking, and comparison.
Project description
ollama-spark
ollama-spark is a terminal-first toolkit to help you pick, download, benchmark, and compare Ollama LLM models for local hardware. The project provides:
- hardware detection (CPU, RAM, GPU)
- a curated model catalog with metadata and task capabilities
- model compatibility recommendations for common use-cases (chat, coding, instruct, vision, etc.)
- an Ollama HTTP client for listing/pulling/generating
- a lightweight benchmark runner (TTFT, TPS, latency) and aggregation
- a CLI (
ollama-spark) to run everything from your terminal
This repository is organized as a Python package and designed to be released to PyPI as ollama-spark.
Goals
- Help users identify which Ollama models are compatible with their local hardware.
- Provide a simple benchmark to measure real-world performance on your machine.
- Make it easy to pull recommended models via the local Ollama daemon and compare model trade-offs.
- Be lightweight, well-documented, and easy to extend.
Table of Contents
- Install
- Quick start
- Concepts
- CLI reference
- How the benchmark works
- Project layout
- Contributing
- Roadmap / Next steps
- License
Install
Recommended: create a virtual environment and install from the project root.
python -m venv .venv
source .venv/bin/activate
pip install -e .
If you want development dependencies (tests/lint):
pip install -e .[dev]
Notes:
- The CLI assumes a running Ollama daemon for list/pull/generate operations (default address:
http://127.0.0.1:11434). - On macOS with Apple Silicon you'll get MPS detection heuristics; for NVIDIA/AMD GPUs the tool uses
nvidia-smi/rocm-smi/lspciwhere available.
Quick start
Detect hardware:
# show a friendly hardware summary
ollama-spark detect
List models available in your local Ollama daemon:
ollama-spark list-models
Get recommendations for coding tasks:
ollama-spark recommend --task coding --top-k 5
Pull a model (streams download progress from Ollama):
ollama-spark pull "llama3.1:8b"
Run a quick benchmark (TTFT, TPS, latency):
ollama-spark benchmark "llama3.1:8b" \
--prompt "Write a short Python function that sorts a list" \
--runs 2 --warmup 1 --timeout 60
Compare models (feature comparison + optional runtime micro-benchmark):
ollama-spark compare llama3.1:8b qwen2.5:7b --task coding --runtime \
--prompt "Write a function to compute fibonacci numbers efficiently"
Concepts
- Hardware profile: collected via
ollama_spark.hardware(CPU, RAM, GPUs). This is converted into a canonicalHardwareProfileused by the recommender. - Model spec: each model in the bundled
data/models.yamlcontainsmin_ram_gb,recommended_ram_gb,min_vram_gb,parameter_billions,capabilities(task scores), andtags. - CompatibilityResult: result of hardware vs model checks (Compatible / Borderline / Incompatible) with reasons and estimated memory needs.
- Benchmark: the runner captures TTFT (time to first token), total latency, TPS (tokens per second), and lightweight resource samples via
psutil. GPU sampling is best-effort and currently limited.
CLI reference
The package installs a console script ollama-spark with the following commands:
detect— detect and display hardwarelist-models— list models available to local Ollamarecommend— recommend models for a task using your hardwarepull— pull a model (streams progress)benchmark— run micro-benchmarks for a modelcompare— feature & optional runtime comparison for 2–4 models
Run ollama-spark --help or ollama-spark <command> --help for details.
Example:
ollama-spark recommend --task instruct --top-k 5
How the benchmark works (brief)
- Warmup runs (configurable) are executed first (not recorded).
- Measured runs call Ollama's
generatestreaming endpoint and:- record wall-clock time until the first token (TTFT)
- record total time the request takes
- sample CPU usage and resident memory periodically using
psutil - estimate tokens generated (tries to use server counts if provided; otherwise naive splitting)
- After all runs the tool computes median and p95 for TTFT and TPS, median latency, error rate, and resource aggregations.
Limitations:
- GPU utilization and VRAM peak require polling vendor tools (
nvidia-smi,rocm-smi) — these are not yet fully implemented in the main aggregated report. - Token counting is approximate unless the Ollama server includes token counts in streaming events.
- Benchmarks will be affected by other local processes and background CPU/GPU load; run them on as quiet a system as possible for repeatable results.
Project layout
Key files and directories:
ollama-spark/
├─ ollama_spark/
│ ├─ __init__.py
│ ├─ cli.py
│ ├─ hardware.py
│ ├─ models.py
│ ├─ ollama_client.py
│ ├─ registry.py
│ ├─ recommender.py
│ ├─ benchmark.py
│ └─ data/
│ └─ models.yaml
├─ tests/
└─ pyproject.toml
Contributing
I want this to be an excellent open source tool — you can help in several ways:
- File issues for bugs or feature requests on the repository issue tracker.
- Improve/extend the
data/models.yamlcatalog — accuracy of RAM/VRAM values and task scores improves recommendations dramatically. - Add tests in
tests/for:- registry parsing and validation
- recommender ranking behavior (unit tests with several hardware profiles)
- Ollama client error handling (mock HTTP responses)
- Help implement GPU metrics collection for benchmark aggregation (NVIDIA + ROCm + Apple).
- Improve the streaming parsing to match your version of Ollama (event formats vary).
Before you create PRs:
- Fork the repository.
- Create a feature branch.
- Make tests for new behavior and ensure
pytestpasses. - Open a PR with a clear description and link to any issues.
Development & CI
Recommended dev commands:
# run tests
pytest
# run linter (if configured)
ruff .
# run CLI locally (editable install)
python -m ollama_spark.cli detect
I will add a GitHub Actions workflow to run tests and lint on PRs and push to main once you confirm CI preferences (Ubuntu + macOS + Python 3.10–3.12 is typical).
Roadmap / Next steps
I will implement these items next (please tell me which you want prioritized):
- README + LICENSE (this file + add MIT license) — done (README).
- Add unit tests for registry parsing and recommender logic. (High priority)
- Add CI workflow (GitHub Actions) for linting and tests. (High priority)
- Implement GPU usage & VRAM sampling (NVIDIA / ROCm) in the benchmark runner. (Medium)
- Improve token counting (integrate tokenizers or use server-provided token counts). (Medium)
- Persist benchmark results to a small local DB and add
historyCLI. (Lower) - Prepare packaging and PyPI release (bump version and add release workflow). (Lower)
Tell me which 2–3 items you want me to implement next and I will continue immediately.
Security & privacy notes
- The tool will talk to a local Ollama daemon only by default. It does not upload hardware information remotely.
- If you decide to add remote registries or model repositories, be careful with credentials and always use secure transfer (HTTPS). I can add secure store for API tokens if needed.
License
This project is intended to be MIT-licensed (I'll add a LICENSE file with your confirmation). If you prefer a different license, tell me which one.
Contact / Maintainer
If you want me to continue I can:
- add the
LICENSEfile, - implement tests + CI,
- add GitHub Actions to run tests & lint,
- prepare a PyPI-ready release and draft changelog.
Tell me which items to prioritize next and whether you want me to:
- Use
MITor another license - Target a specific set of Python versions for CI
- Add support for automatic model downloads (pull) after recommendations
I'll proceed once you confirm the next priorities.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollama_spark-0.1.1.tar.gz.
File metadata
- Download URL: ollama_spark-0.1.1.tar.gz
- Upload date:
- Size: 52.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c32f0dd1668913318b962a948422f9032ac703a777f6f9d58af0e2b8178db6e
|
|
| MD5 |
d6b63fd51230f91ae4302f552c4160a2
|
|
| BLAKE2b-256 |
aff924ecac387d1b20cea85755c0775399c0882dc4e01c044791714fdc77e4f0
|
File details
Details for the file ollama_spark-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ollama_spark-0.1.1-py3-none-any.whl
- Upload date:
- Size: 49.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7ef02c1db58e606967004048e8442324c01f4fccf75600c3470333211b54c82
|
|
| MD5 |
d1f95fa2b96aaabafd95574c85a5ccce
|
|
| BLAKE2b-256 |
a6b11bbbe989eb0188b5b84443aceb0705b7fb0157b6876bfd7ffd322c3691a6
|