Skip to main content

Smart multimodal router — LLM inference, image generation, speech-to-text, and embeddings across your device fleet. Cross-platform: macOS, Linux, Windows.

Project description

Ollama Herd

PyPI version License: MIT Python 3.11+

Turn all your devices into one local AI cluster. Ollama Herd is a smart inference router and load balancer that auto-discovers Ollama nodes via mDNS, routes LLMs, image generation, speech-to-text, and embeddings to the optimal device using intelligent scoring. OpenAI-compatible API. Zero config. Zero cost.

Why Ollama Herd?

  • Your spare Mac is wasting compute — pool all your devices into one fleet
  • Single Ollama bottlenecks agents — distribute requests across machines automatically
  • Cloud APIs cost $450-1,800/month at fleet scale — local inference is zero marginal cost
  • No config files, no Docker, no Kubernetes — two commands, mDNS auto-discovery
  • Not just LLMs — routes image generation (FLUX), speech-to-text (Qwen3-ASR), and embeddings too
  • The fleet gets smarter over time — capacity learning, thermal awareness, meeting detection

Quick Start

pip install ollama-herd

Or with Homebrew (macOS/Linux):

brew tap geeks-accelerator/ollama-herd
brew install ollama-herd

On your router machine:

herd

On each device running Ollama:

herd-node

That's it. The node discovers the router via mDNS and starts sending heartbeats. No config files needed.

To skip mDNS and connect directly: herd-node --router-url http://router-ip:11435

Features

Feature Description
Smart Scoring Routes to the best device based on thermal state, memory fit, queue depth, latency, affinity, availability, and context fit
Zero-Config Discovery mDNS auto-discovery — no IPs, no config files, no manual setup
Multimodal Routing LLMs, vision (gemma3, llava, llama3.2-vision), embeddings, image gen (FLUX via mflux/DiffusionKit), speech-to-text (Qwen3-ASR)
Live Dashboard Fleet overview, trends, model insights, per-app analytics, benchmarks, health, recommendations, settings
Capacity Learning 168-slot weekly behavioral model per device — learns when your machines are available
Auto-Retry & Fallbacks Transparent retry on failure + client-specified backup models
Thinking Model Support Auto-detects DeepSeek-R1, QwQ, phi-4-reasoning and inflates token budgets to prevent empty responses
Smart Benchmarks Auto-discovers fleet, benchmarks all 5 model types, tracks performance over time
Dynamic Context Measures actual token usage, auto-adjusts context windows to free KV cache memory
Fleet Intelligence AI-generated fleet briefings with health summaries, trend analysis, and actionable recommendations
Health Engine 18 automated checks: memory, thermal, context waste, thrashing, timeouts, errors, zombies, priority models, and more
Request Tagging Per-app analytics via tags — track usage, latency, and errors per application or team

Usage

Point any OpenAI-compatible client at the router:

from openai import OpenAI

client = OpenAI(base_url="http://router-ip:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content, end="")

Or use the Ollama API directly:

curl http://router-ip:11435/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Model Fallbacks

curl http://router-ip:11435/v1/chat/completions -d '{
  "model": "llama3.3:70b",
  "fallback_models": ["qwen2.5:32b", "qwen2.5:7b"],
  "messages": [{"role": "user", "content": "Hello!"}]
}'

The router tries each model in order, falling back seamlessly if one is unavailable.

Beyond LLMs

The same router handles five model types — install a backend on any node and it's automatically detected.

Vision (Image Understanding)

from openai import OpenAI
client = OpenAI(base_url="http://router-ip:11435/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="gemma3:27b",  # or llama3.2-vision, llava, moondream
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]
    }]
)

Works with any Ollama vision model. Both OpenAI and Ollama formats supported — the router auto-converts.

Image Generation

# Install a backend (any node)
uv tool install mflux

# Generate
curl -o sunset.png http://router-ip:11435/api/generate-image \
  -d '{"model": "z-image-turbo", "prompt": "a sunset over mountains", "width": 1024, "height": 1024}'

Supports mflux (FLUX), DiffusionKit (Stable Diffusion 3/3.5), and Ollama native models. See Image Generation Guide.

Speech-to-Text

# Install backend (any node)
pip install 'mlx-qwen3-asr[serve]'

# Transcribe
curl http://router-ip:11435/api/transcribe -F "file=@meeting.wav" -F "model=qwen3-asr"

Embeddings

curl http://router-ip:11435/api/embed \
  -d '{"model": "nomic-embed-text", "input": ["first document", "second document"]}'

Works with any Ollama embedding model: nomic-embed-text, mxbai-embed-large, all-minilm, snowflake-arctic-embed.

Works With

Ollama Herd is a drop-in replacement — just change the base URL:

Framework Integration
Open WebUI Set Ollama URL to http://router-ip:11435 in admin settings
LangChain ChatOpenAI(base_url="http://router-ip:11435/v1")
CrewAI LLM(base_url="http://router-ip:11435")
Aider --openai-api-base http://router-ip:11435/v1
Continue.dev Set apiBase in config.json
OpenHands LLM_BASE_URL=http://router-ip:11435/v1
OpenClaw See OpenClaw Integration Guide
Any OpenAI client Change base_url to http://router-ip:11435/v1

Platform Support

Ollama Herd runs on macOS, Linux, and Windows — anywhere Ollama runs.

Feature macOS Linux Windows
LLM routing, scoring, queues Yes Yes Yes
Embeddings proxy Yes Yes Yes
mDNS auto-discovery Yes Yes Yes
Dashboard & traces Yes Yes Yes
Image gen (mflux, DiffusionKit) Yes (Apple Silicon) -- --
Image gen (Ollama native) Yes Yes Yes
Speech-to-text (MLX) Yes (Apple Silicon) -- --
Meeting detection (camera/mic) Yes -- --
Memory pressure detection Yes Yes --

Core routing works identically on all platforms. macOS-only features degrade gracefully.

Architecture

┌─────────────────────────────────────────────────────┐
│  Client (OpenAI SDK, curl, any HTTP client)         │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────┐
│  Herd Router (:11435)                               │
│  ┌────────────┐ ┌──────────┐ ┌───────────────────┐  │
│  │  Scoring    │ │  Queue   │ │  Streaming Proxy  │  │
│  │  Engine     │ │  Manager │ │  (format convert) │  │
│  └────────────┘ └──────────┘ └───────────────────┘  │
│  ┌────────────┐ ┌──────────┐ ┌───────────────────┐  │
│  │  Trace     │ │  Health  │ │  Dashboard +      │  │
│  │  Store     │ │  Engine  │ │  SSE + Charts     │  │
│  └────────────┘ └──────────┘ └───────────────────┘  │
└──────────┬──────────────────────────┬───────────────┘
           │ heartbeats               │ inference
           ▼                          ▼
┌──────────────────┐       ┌──────────────────┐
│  Herd Node A     │       │  Herd Node B     │
│  (agent + Ollama)│       │  (agent + Ollama)│
│  ┌────────────┐  │       │  ┌────────────┐  │
│  │  Capacity  │  │       │  │  LAN Proxy  │  │
│  │  Learner   │  │       │  │  (auto TCP) │  │
│  └────────────┘  │       └──└────────────┘──┘
└──────────────────┘

Two CLI entry points, one Python package:

  • herd — FastAPI server with scoring, queues, streaming proxy, trace store, health engine, and dashboard
  • herd-node — lightweight agent that collects system metrics, sends heartbeats, and optionally learns capacity patterns

Documentation

Document Description
API Reference All endpoints with request/response schemas
Configuration Reference All 47+ environment variables with tuning guidance
Operations Guide Logging, traces, fallbacks, retry, drain, streaming, context protection
Routing Engine Scoring pipeline deep dive
Adaptive Capacity Capacity learner, meeting detection, app fingerprinting
Request Tagging Per-app analytics and tagging strategies
Thinking Models Chain-of-thought models, budget inflation, diagnostic headers
Image Generation mflux, DiffusionKit, Ollama native setup
Troubleshooting Common issues, LAN debugging, operational gotchas
Changelog What's new in each release

Optimize Ollama for Your Hardware

Ollama's defaults are conservative. On machines with lots of memory, set these to actually use the hardware you paid for:

Setting Default Recommended Why
OLLAMA_KEEP_ALIVE 5m -1 (forever) Don't unload models from memory when you have RAM to spare
OLLAMA_MAX_LOADED_MODELS auto -1 (unlimited) Let multiple models stay hot simultaneously
OLLAMA_NUM_PARALLEL auto 2-4 Prevents KV cache bloat on high-memory machines

Set via launchctl setenv (macOS), systemctl edit ollama (Linux), or system environment variables (Windows). See Configuration Reference for details.

Development

git clone https://github.com/geeks-accelerator/ollama-herd.git
cd ollama-herd
uv sync                              # install deps
uv run herd                          # start router
uv run herd-node                     # start node agent

uv sync --extra dev                  # install test deps
uv run pytest                        # run all tests (~5s)
uv run ruff check src/               # lint
uv run ruff format src/              # format

Contributing

Whether you're carbon-based or silicon-based, contributions are welcome. This project is built by humans and AI agents working together.

For humans: Fork it, run the tests (uv run pytest), make your change, open a PR. Start with CONTRIBUTING.md for guidelines and Architecture Decisions for context.

For AI agents: Read CLAUDE.md first — it's your onboarding doc. The project uses docs/issues.md for bug tracking and docs/observations.md for operational learnings.

Good first contributions:

Questions? Open a Discussion.

If Ollama Herd is useful to you, star the repo — it helps others discover the project and keeps the herd growing.

Requirements

  • Python 3.11+
  • Ollama running on each device
  • Multi-device setups work automatically — the node agent starts a LAN proxy if Ollama is only listening on localhost

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_herd-0.6.0.tar.gz (865.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollama_herd-0.6.0-py3-none-any.whl (341.8 kB view details)

Uploaded Python 3

File details

Details for the file ollama_herd-0.6.0.tar.gz.

File metadata

  • Download URL: ollama_herd-0.6.0.tar.gz
  • Upload date:
  • Size: 865.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ollama_herd-0.6.0.tar.gz
Algorithm Hash digest
SHA256 2c8d69ff6f203ba6450575a7df975a5bdbb758e6d54f9561e560fa72d522906d
MD5 07c5c795993b372b1efd345ad9e4c08e
BLAKE2b-256 fbaeee8f399e34e602bf90c57a38984960661ddb7c636a0e36d1cad34d809a8e

See more details on using hashes here.

File details

Details for the file ollama_herd-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: ollama_herd-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 341.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ollama_herd-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c29f98a61e25b80058ead1d3cc72a151ffe32a50c8ea91b60c2ac2fc4f915c3
MD5 0bf1fbc2927cc6eb7714332522000e7e
BLAKE2b-256 a0311e5213b408fb8fc2bf1a0858a09957ca4b085e0c67556d7cc7c9cbfe0ea1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page