Skip to main content

Terminal toolkit for local Ollama model recommendation, benchmarking, and comparison.

Project description

ollama-spark

ollama-spark is a terminal-first toolkit to help you pick, download, benchmark, and compare Ollama LLM models for local hardware. The project provides:

  • hardware detection (CPU, RAM, GPU)
  • a curated model catalog with metadata and task capabilities
  • model compatibility recommendations for common use-cases (chat, coding, instruct, vision, etc.)
  • an Ollama HTTP client for listing/pulling/generating
  • a lightweight benchmark runner (TTFT, TPS, latency) and aggregation
  • a CLI (ollama-spark) to run everything from your terminal

This repository is organized as a Python package and designed to be released to PyPI as ollama-spark.


Goals

  • Help users identify which Ollama models are compatible with their local hardware.
  • Provide a simple benchmark to measure real-world performance on your machine.
  • Make it easy to pull recommended models via the local Ollama daemon and compare model trade-offs.
  • Be lightweight, well-documented, and easy to extend.

Table of Contents


Install

Recommended: create a virtual environment and install from the project root.

python -m venv .venv
source .venv/bin/activate
pip install -e .

If you want development dependencies (tests/lint):

pip install -e .[dev]

Notes:

  • The CLI assumes a running Ollama daemon for list/pull/generate operations (default address: http://127.0.0.1:11434).
  • On macOS with Apple Silicon you'll get MPS detection heuristics; for NVIDIA/AMD GPUs the tool uses nvidia-smi / rocm-smi / lspci where available.

Quick start

Detect hardware:

# show a friendly hardware summary
ollama-spark detect

List models available in your local Ollama daemon:

ollama-spark list-models

Get recommendations for coding tasks:

ollama-spark recommend --task coding --top-k 5

Pull a model (streams download progress from Ollama):

ollama-spark pull "llama3.1:8b"

Run a quick benchmark (TTFT, TPS, latency):

ollama-spark benchmark "llama3.1:8b" \
  --prompt "Write a short Python function that sorts a list" \
  --runs 2 --warmup 1 --timeout 60

Compare models (feature comparison + optional runtime micro-benchmark):

ollama-spark compare llama3.1:8b qwen2.5:7b --task coding --runtime \
  --prompt "Write a function to compute fibonacci numbers efficiently"

Concepts

  • Hardware profile: collected via ollama_spark.hardware (CPU, RAM, GPUs). This is converted into a canonical HardwareProfile used by the recommender.
  • Model spec: each model in the bundled data/models.yaml contains min_ram_gb, recommended_ram_gb, min_vram_gb, parameter_billions, capabilities (task scores), and tags.
  • CompatibilityResult: result of hardware vs model checks (Compatible / Borderline / Incompatible) with reasons and estimated memory needs.
  • Benchmark: the runner captures TTFT (time to first token), total latency, TPS (tokens per second), and lightweight resource samples via psutil. GPU sampling is best-effort and currently limited.

CLI reference

The package installs a console script ollama-spark with the following commands:

  • detect — detect and display hardware
  • list-models — list models available to local Ollama
  • recommend — recommend models for a task using your hardware
  • pull — pull a model (streams progress)
  • benchmark — run micro-benchmarks for a model
  • compare — feature & optional runtime comparison for 2–4 models

Run ollama-spark --help or ollama-spark <command> --help for details.

Example:

ollama-spark recommend --task instruct --top-k 5

How the benchmark works (brief)

  • Warmup runs (configurable) are executed first (not recorded).
  • Measured runs call Ollama's generate streaming endpoint and:
    • record wall-clock time until the first token (TTFT)
    • record total time the request takes
    • sample CPU usage and resident memory periodically using psutil
    • estimate tokens generated (tries to use server counts if provided; otherwise naive splitting)
  • After all runs the tool computes median and p95 for TTFT and TPS, median latency, error rate, and resource aggregations.

Limitations:

  • GPU utilization and VRAM peak require polling vendor tools (nvidia-smi, rocm-smi) — these are not yet fully implemented in the main aggregated report.
  • Token counting is approximate unless the Ollama server includes token counts in streaming events.
  • Benchmarks will be affected by other local processes and background CPU/GPU load; run them on as quiet a system as possible for repeatable results.

Project layout

Key files and directories:

ollama-spark/
├─ ollama_spark/
│  ├─ __init__.py
│  ├─ cli.py
│  ├─ hardware.py
│  ├─ models.py
│  ├─ ollama_client.py
│  ├─ registry.py
│  ├─ recommender.py
│  ├─ benchmark.py
│  └─ data/
│     └─ models.yaml
├─ tests/
└─ pyproject.toml

Contributing

I want this to be an excellent open source tool — you can help in several ways:

  • File issues for bugs or feature requests on the repository issue tracker.
  • Improve/extend the data/models.yaml catalog — accuracy of RAM/VRAM values and task scores improves recommendations dramatically.
  • Add tests in tests/ for:
    • registry parsing and validation
    • recommender ranking behavior (unit tests with several hardware profiles)
    • Ollama client error handling (mock HTTP responses)
  • Help implement GPU metrics collection for benchmark aggregation (NVIDIA + ROCm + Apple).
  • Improve the streaming parsing to match your version of Ollama (event formats vary).

Before you create PRs:

  1. Fork the repository.
  2. Create a feature branch.
  3. Make tests for new behavior and ensure pytest passes.
  4. Open a PR with a clear description and link to any issues.

Development & CI

Recommended dev commands:

# run tests
pytest

# run linter (if configured)
ruff .

# run CLI locally (editable install)
python -m ollama_spark.cli detect

I will add a GitHub Actions workflow to run tests and lint on PRs and push to main once you confirm CI preferences (Ubuntu + macOS + Python 3.10–3.12 is typical).


Roadmap / Next steps

I will implement these items next (please tell me which you want prioritized):

  1. README + LICENSE (this file + add MIT license) — done (README).
  2. Add unit tests for registry parsing and recommender logic. (High priority)
  3. Add CI workflow (GitHub Actions) for linting and tests. (High priority)
  4. Implement GPU usage & VRAM sampling (NVIDIA / ROCm) in the benchmark runner. (Medium)
  5. Improve token counting (integrate tokenizers or use server-provided token counts). (Medium)
  6. Persist benchmark results to a small local DB and add history CLI. (Lower)
  7. Prepare packaging and PyPI release (bump version and add release workflow). (Lower)

Tell me which 2–3 items you want me to implement next and I will continue immediately.


Security & privacy notes

  • The tool will talk to a local Ollama daemon only by default. It does not upload hardware information remotely.
  • If you decide to add remote registries or model repositories, be careful with credentials and always use secure transfer (HTTPS). I can add secure store for API tokens if needed.

License

This project is intended to be MIT-licensed (I'll add a LICENSE file with your confirmation). If you prefer a different license, tell me which one.


Contact / Maintainer

If you want me to continue I can:

  • add the LICENSE file,
  • implement tests + CI,
  • add GitHub Actions to run tests & lint,
  • prepare a PyPI-ready release and draft changelog.

Tell me which items to prioritize next and whether you want me to:

  • Use MIT or another license
  • Target a specific set of Python versions for CI
  • Add support for automatic model downloads (pull) after recommendations

I'll proceed once you confirm the next priorities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_spark-0.1.1.tar.gz (52.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollama_spark-0.1.1-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file ollama_spark-0.1.1.tar.gz.

File metadata

  • Download URL: ollama_spark-0.1.1.tar.gz
  • Upload date:
  • Size: 52.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ollama_spark-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2c32f0dd1668913318b962a948422f9032ac703a777f6f9d58af0e2b8178db6e
MD5 d6b63fd51230f91ae4302f552c4160a2
BLAKE2b-256 aff924ecac387d1b20cea85755c0775399c0882dc4e01c044791714fdc77e4f0

See more details on using hashes here.

File details

Details for the file ollama_spark-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ollama_spark-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 49.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ollama_spark-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b7ef02c1db58e606967004048e8442324c01f4fccf75600c3470333211b54c82
MD5 d1f95fa2b96aaabafd95574c85a5ccce
BLAKE2b-256 a6b11bbbe989eb0188b5b84443aceb0705b7fb0157b6876bfd7ffd322c3691a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page