Skip to main content

Terminal toolkit for local Ollama model recommendation, benchmarking, and comparison.

Project description

ollama-spark

ollama-spark is a terminal-first toolkit to help you pick, download, benchmark, and compare Ollama LLM models for local hardware. The project provides:

  • hardware detection (CPU, RAM, GPU)
  • a curated model catalog with metadata and task capabilities
  • model compatibility recommendations for common use-cases (chat, coding, instruct, vision, etc.)
  • an Ollama HTTP client for listing/pulling/generating
  • a lightweight benchmark runner (TTFT, TPS, latency) and aggregation
  • a CLI (ollama-spark) to run everything from your terminal

This repository is organized as a Python package and designed to be released to PyPI as ollama-spark.


Goals

  • Help users identify which Ollama models are compatible with their local hardware.
  • Provide a simple benchmark to measure real-world performance on your machine.
  • Make it easy to pull recommended models via the local Ollama daemon and compare model trade-offs.
  • Be lightweight, well-documented, and easy to extend.

Table of Contents


Install

Recommended: create a virtual environment and install from the project root.

python -m venv .venv
source .venv/bin/activate
pip install -e .

If you want development dependencies (tests/lint):

pip install -e .[dev]

Notes:

  • The CLI assumes a running Ollama daemon for list/pull/generate operations (default address: http://127.0.0.1:11434).
  • On macOS with Apple Silicon you'll get MPS detection heuristics; for NVIDIA/AMD GPUs the tool uses nvidia-smi / rocm-smi / lspci where available.

Quick start

Detect hardware:

# show a friendly hardware summary
ollama-spark detect

List models available in your local Ollama daemon:

ollama-spark list-models

Get recommendations for coding tasks:

ollama-spark recommend --task coding --top-k 5

Pull a model (streams download progress from Ollama):

ollama-spark pull "llama3.1:8b"

Run a quick benchmark (TTFT, TPS, latency):

ollama-spark benchmark "llama3.1:8b" \
  --prompt "Write a short Python function that sorts a list" \
  --runs 2 --warmup 1 --timeout 60

Compare models (feature comparison + optional runtime micro-benchmark):

ollama-spark compare llama3.1:8b qwen2.5:7b --task coding --runtime \
  --prompt "Write a function to compute fibonacci numbers efficiently"

Concepts

  • Hardware profile: collected via ollama_spark.hardware (CPU, RAM, GPUs). This is converted into a canonical HardwareProfile used by the recommender.
  • Model spec: each model in the bundled data/models.yaml contains min_ram_gb, recommended_ram_gb, min_vram_gb, parameter_billions, capabilities (task scores), and tags.
  • CompatibilityResult: result of hardware vs model checks (Compatible / Borderline / Incompatible) with reasons and estimated memory needs.
  • Benchmark: the runner captures TTFT (time to first token), total latency, TPS (tokens per second), and lightweight resource samples via psutil. GPU sampling is best-effort and currently limited.

CLI reference

The package installs a console script ollama-spark with the following commands:

  • detect — detect and display hardware
  • list-models — list models available to local Ollama
  • recommend — recommend models for a task using your hardware
  • pull — pull a model (streams progress)
  • benchmark — run micro-benchmarks for a model
  • compare — feature & optional runtime comparison for 2–4 models

Run ollama-spark --help or ollama-spark <command> --help for details.

Example:

ollama-spark recommend --task instruct --top-k 5

How the benchmark works (brief)

  • Warmup runs (configurable) are executed first (not recorded).
  • Measured runs call Ollama's generate streaming endpoint and:
    • record wall-clock time until the first token (TTFT)
    • record total time the request takes
    • sample CPU usage and resident memory periodically using psutil
    • estimate tokens generated (tries to use server counts if provided; otherwise naive splitting)
  • After all runs the tool computes median and p95 for TTFT and TPS, median latency, error rate, and resource aggregations.

Limitations:

  • GPU utilization and VRAM peak require polling vendor tools (nvidia-smi, rocm-smi) — these are not yet fully implemented in the main aggregated report.
  • Token counting is approximate unless the Ollama server includes token counts in streaming events.
  • Benchmarks will be affected by other local processes and background CPU/GPU load; run them on as quiet a system as possible for repeatable results.

Project layout

Key files and directories:

ollama-spark/
├─ ollama_spark/
│  ├─ __init__.py
│  ├─ cli.py
│  ├─ hardware.py
│  ├─ models.py
│  ├─ ollama_client.py
│  ├─ registry.py
│  ├─ recommender.py
│  ├─ benchmark.py
│  └─ data/
│     └─ models.yaml
├─ tests/
└─ pyproject.toml

Contributing

I want this to be an excellent open source tool — you can help in several ways:

  • File issues for bugs or feature requests on the repository issue tracker.
  • Improve/extend the data/models.yaml catalog — accuracy of RAM/VRAM values and task scores improves recommendations dramatically.
  • Add tests in tests/ for:
    • registry parsing and validation
    • recommender ranking behavior (unit tests with several hardware profiles)
    • Ollama client error handling (mock HTTP responses)
  • Help implement GPU metrics collection for benchmark aggregation (NVIDIA + ROCm + Apple).
  • Improve the streaming parsing to match your version of Ollama (event formats vary).

Before you create PRs:

  1. Fork the repository.
  2. Create a feature branch.
  3. Make tests for new behavior and ensure pytest passes.
  4. Open a PR with a clear description and link to any issues.

Development & CI

Recommended dev commands:

# run tests
pytest

# run linter (if configured)
ruff .

# run CLI locally (editable install)
python -m ollama_spark.cli detect

I will add a GitHub Actions workflow to run tests and lint on PRs and push to main once you confirm CI preferences (Ubuntu + macOS + Python 3.10–3.12 is typical).


Roadmap / Next steps

I will implement these items next (please tell me which you want prioritized):

  1. README + LICENSE (this file + add MIT license) — done (README).
  2. Add unit tests for registry parsing and recommender logic. (High priority)
  3. Add CI workflow (GitHub Actions) for linting and tests. (High priority)
  4. Implement GPU usage & VRAM sampling (NVIDIA / ROCm) in the benchmark runner. (Medium)
  5. Improve token counting (integrate tokenizers or use server-provided token counts). (Medium)
  6. Persist benchmark results to a small local DB and add history CLI. (Lower)
  7. Prepare packaging and PyPI release (bump version and add release workflow). (Lower)

Tell me which 2–3 items you want me to implement next and I will continue immediately.


Security & privacy notes

  • The tool will talk to a local Ollama daemon only by default. It does not upload hardware information remotely.
  • If you decide to add remote registries or model repositories, be careful with credentials and always use secure transfer (HTTPS). I can add secure store for API tokens if needed.

License

This project is intended to be MIT-licensed (I'll add a LICENSE file with your confirmation). If you prefer a different license, tell me which one.


Contact / Maintainer

If you want me to continue I can:

  • add the LICENSE file,
  • implement tests + CI,
  • add GitHub Actions to run tests & lint,
  • prepare a PyPI-ready release and draft changelog.

Tell me which items to prioritize next and whether you want me to:

  • Use MIT or another license
  • Target a specific set of Python versions for CI
  • Add support for automatic model downloads (pull) after recommendations

I'll proceed once you confirm the next priorities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_spark-0.1.0.tar.gz (52.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollama_spark-0.1.0-py3-none-any.whl (49.1 kB view details)

Uploaded Python 3

File details

Details for the file ollama_spark-0.1.0.tar.gz.

File metadata

  • Download URL: ollama_spark-0.1.0.tar.gz
  • Upload date:
  • Size: 52.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ollama_spark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5e76ed23bcbfa35dd53cf073ac25f8fb3eb60fa840febf1b17265b4401f7e3f6
MD5 ce61528b3a3f1cd971db4513d0d40779
BLAKE2b-256 3baeafad4aa0df9e5d4e2a8c57ad209b9f882efffc4177857c4243d8a7df4727

See more details on using hashes here.

File details

Details for the file ollama_spark-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ollama_spark-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 49.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ollama_spark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3295a05a3dd8bf34951d9456904760a6c3b4686165a119ebd790d59bba8bf980
MD5 7a845d70b99dcf089a9d4e78677f9b66
BLAKE2b-256 7772d0f41eb6035535fc54bc2a794a7dfa6797c81275cf90bafe2dd27dc394e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page