Smart multimodal router — LLM inference, image generation, speech-to-text, and embeddings across your device fleet. Cross-platform: macOS, Linux, Windows.
Project description
Ollama Herd
Turn all your devices into one local AI cluster. Ollama Herd is a smart inference router and load balancer that auto-discovers Ollama nodes via mDNS, routes LLMs, image generation, speech-to-text, and embeddings to the optimal device using intelligent scoring. OpenAI-compatible API. Zero config. Zero cost.
Why Ollama Herd?
- Your spare Mac is wasting compute — pool all your devices into one fleet
- Single Ollama bottlenecks agents — distribute requests across machines automatically
- Cloud APIs cost $450-1,800/month at fleet scale — local inference is zero marginal cost
- No config files, no Docker, no Kubernetes — two commands, mDNS auto-discovery
- Not just LLMs — routes image generation (FLUX), speech-to-text (Qwen3-ASR), and embeddings too
- The fleet gets smarter over time — capacity learning, thermal awareness, meeting detection
Quick Start
pip install ollama-herd
Or with Homebrew (macOS/Linux):
brew tap geeks-accelerator/ollama-herd
brew install ollama-herd
On your router machine:
herd
On each device running Ollama:
herd-node
That's it. The node discovers the router via mDNS and starts sending heartbeats. No config files needed.
To skip mDNS and connect directly:
herd-node --router-url http://router-ip:11435
Features
| Feature | Description |
|---|---|
| Smart Scoring | Routes to the best device based on thermal state, memory fit, queue depth, latency, affinity, availability, and context fit |
| Zero-Config Discovery | mDNS auto-discovery — no IPs, no config files, no manual setup |
| Multimodal Routing | LLMs, vision (gemma3, llava, llama3.2-vision), embeddings, image gen (FLUX via mflux/DiffusionKit), speech-to-text (Qwen3-ASR) |
| Live Dashboard | Fleet overview, trends, model insights, per-app analytics, benchmarks, health, recommendations, settings |
| Capacity Learning | 168-slot weekly behavioral model per device — learns when your machines are available |
| Auto-Retry & Fallbacks | Transparent retry on failure + client-specified backup models |
| Thinking Model Support | Auto-detects DeepSeek-R1, QwQ, phi-4-reasoning and inflates token budgets to prevent empty responses |
| Smart Benchmarks | Auto-discovers fleet, benchmarks all 5 model types, tracks performance over time |
| Dynamic Context | Measures actual token usage, auto-adjusts context windows to free KV cache memory |
| Fleet Intelligence | AI-generated fleet briefings with health summaries, trend analysis, and actionable recommendations |
| Health Engine | 18 automated checks: memory, thermal, context waste, thrashing, timeouts, errors, zombies, priority models, and more |
| Request Tagging | Per-app analytics via tags — track usage, latency, and errors per application or team |
Usage
Point any OpenAI-compatible client at the router:
from openai import OpenAI
client = OpenAI(base_url="http://router-ip:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
Or use the Ollama API directly:
curl http://router-ip:11435/api/chat -d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Model Fallbacks
curl http://router-ip:11435/v1/chat/completions -d '{
"model": "llama3.3:70b",
"fallback_models": ["qwen2.5:32b", "qwen2.5:7b"],
"messages": [{"role": "user", "content": "Hello!"}]
}'
The router tries each model in order, falling back seamlessly if one is unavailable.
Beyond LLMs
The same router handles five model types — install a backend on any node and it's automatically detected.
Vision (Image Understanding)
from openai import OpenAI
client = OpenAI(base_url="http://router-ip:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma3:27b", # or llama3.2-vision, llava, moondream
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}]
)
Works with any Ollama vision model. Both OpenAI and Ollama formats supported — the router auto-converts.
Image Generation
# Install a backend (any node)
uv tool install mflux
# Generate
curl -o sunset.png http://router-ip:11435/api/generate-image \
-d '{"model": "z-image-turbo", "prompt": "a sunset over mountains", "width": 1024, "height": 1024}'
Supports mflux (FLUX), DiffusionKit (Stable Diffusion 3/3.5), and Ollama native models. See Image Generation Guide.
Speech-to-Text
# Install backend (any node)
pip install 'mlx-qwen3-asr[serve]'
# Transcribe
curl http://router-ip:11435/api/transcribe -F "file=@meeting.wav" -F "model=qwen3-asr"
Embeddings
curl http://router-ip:11435/api/embed \
-d '{"model": "nomic-embed-text", "input": ["first document", "second document"]}'
Works with any Ollama embedding model: nomic-embed-text, mxbai-embed-large, all-minilm, snowflake-arctic-embed.
Works With
Ollama Herd is a drop-in replacement — just change the base URL:
| Framework | Integration |
|---|---|
| Open WebUI | Set Ollama URL to http://router-ip:11435 in admin settings |
| LangChain | ChatOpenAI(base_url="http://router-ip:11435/v1") |
| CrewAI | LLM(base_url="http://router-ip:11435") |
| Aider | --openai-api-base http://router-ip:11435/v1 |
| Continue.dev | Set apiBase in config.json |
| OpenHands | LLM_BASE_URL=http://router-ip:11435/v1 |
| OpenClaw | See OpenClaw Integration Guide |
| Any OpenAI client | Change base_url to http://router-ip:11435/v1 |
Platform Support
Ollama Herd runs on macOS, Linux, and Windows — anywhere Ollama runs.
| Feature | macOS | Linux | Windows |
|---|---|---|---|
| LLM routing, scoring, queues | Yes | Yes | Yes |
| Embeddings proxy | Yes | Yes | Yes |
| mDNS auto-discovery | Yes | Yes | Yes |
| Dashboard & traces | Yes | Yes | Yes |
| Image gen (mflux, DiffusionKit) | Yes (Apple Silicon) | -- | -- |
| Image gen (Ollama native) | Yes | Yes | Yes |
| Speech-to-text (MLX) | Yes (Apple Silicon) | -- | -- |
| Meeting detection (camera/mic) | Yes | -- | -- |
| Memory pressure detection | Yes | Yes | -- |
Core routing works identically on all platforms. macOS-only features degrade gracefully.
Architecture
┌─────────────────────────────────────────────────────┐
│ Client (OpenAI SDK, curl, any HTTP client) │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Herd Router (:11435) │
│ ┌────────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Scoring │ │ Queue │ │ Streaming Proxy │ │
│ │ Engine │ │ Manager │ │ (format convert) │ │
│ └────────────┘ └──────────┘ └───────────────────┘ │
│ ┌────────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Trace │ │ Health │ │ Dashboard + │ │
│ │ Store │ │ Engine │ │ SSE + Charts │ │
│ └────────────┘ └──────────┘ └───────────────────┘ │
└──────────┬──────────────────────────┬───────────────┘
│ heartbeats │ inference
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Herd Node A │ │ Herd Node B │
│ (agent + Ollama)│ │ (agent + Ollama)│
│ ┌────────────┐ │ │ ┌────────────┐ │
│ │ Capacity │ │ │ │ LAN Proxy │ │
│ │ Learner │ │ │ │ (auto TCP) │ │
│ └────────────┘ │ └──└────────────┘──┘
└──────────────────┘
Two CLI entry points, one Python package:
herd— FastAPI server with scoring, queues, streaming proxy, trace store, health engine, and dashboardherd-node— lightweight agent that collects system metrics, sends heartbeats, and optionally learns capacity patterns
Documentation
| Document | Description |
|---|---|
| API Reference | All endpoints with request/response schemas |
| Configuration Reference | All 47+ environment variables with tuning guidance |
| Operations Guide | Logging, traces, fallbacks, retry, drain, streaming, context protection |
| Routing Engine | Scoring pipeline deep dive |
| Adaptive Capacity | Capacity learner, meeting detection, app fingerprinting |
| Request Tagging | Per-app analytics and tagging strategies |
| Thinking Models | Chain-of-thought models, budget inflation, diagnostic headers |
| Image Generation | mflux, DiffusionKit, Ollama native setup |
| Troubleshooting | Common issues, LAN debugging, operational gotchas |
| Changelog | What's new in each release |
Optimize Ollama for Your Hardware
Ollama's defaults are conservative. On machines with lots of memory, set these to actually use the hardware you paid for:
| Setting | Default | Recommended | Why |
|---|---|---|---|
OLLAMA_KEEP_ALIVE |
5m |
-1 (forever) |
Don't unload models from memory when you have RAM to spare |
OLLAMA_MAX_LOADED_MODELS |
auto | -1 (unlimited) |
Let multiple models stay hot simultaneously |
OLLAMA_NUM_PARALLEL |
auto | 2-4 |
Prevents KV cache bloat on high-memory machines |
Set via launchctl setenv (macOS), systemctl edit ollama (Linux), or system environment variables (Windows). See Configuration Reference for details.
Development
git clone https://github.com/geeks-accelerator/ollama-herd.git
cd ollama-herd
uv sync # install deps
uv run herd # start router
uv run herd-node # start node agent
uv sync --extra dev # install test deps
uv run pytest # run all tests (~5s)
uv run ruff check src/ # lint
uv run ruff format src/ # format
Contributing
Whether you're carbon-based or silicon-based, contributions are welcome. This project is built by humans and AI agents working together.
For humans: Fork it, run the tests (uv run pytest), make your change, open a PR. Start with CONTRIBUTING.md for guidelines and Architecture Decisions for context.
For AI agents: Read CLAUDE.md first — it's your onboarding doc. The project uses docs/issues.md for bug tracking and docs/observations.md for operational learnings.
Good first contributions:
- Pick an open issue from
docs/issues.md - Integrate with a new agent framework and document it
- Run the fleet and add an observation to
docs/observations.md
Questions? Open a Discussion.
If Ollama Herd is useful to you, star the repo — it helps others discover the project and keeps the herd growing.
Requirements
- Python 3.11+
- Ollama running on each device
- Multi-device setups work automatically — the node agent starts a LAN proxy if Ollama is only listening on localhost
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollama_herd-0.6.0.tar.gz.
File metadata
- Download URL: ollama_herd-0.6.0.tar.gz
- Upload date:
- Size: 865.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c8d69ff6f203ba6450575a7df975a5bdbb758e6d54f9561e560fa72d522906d
|
|
| MD5 |
07c5c795993b372b1efd345ad9e4c08e
|
|
| BLAKE2b-256 |
fbaeee8f399e34e602bf90c57a38984960661ddb7c636a0e36d1cad34d809a8e
|
File details
Details for the file ollama_herd-0.6.0-py3-none-any.whl.
File metadata
- Download URL: ollama_herd-0.6.0-py3-none-any.whl
- Upload date:
- Size: 341.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c29f98a61e25b80058ead1d3cc72a151ffe32a50c8ea91b60c2ac2fc4f915c3
|
|
| MD5 |
0bf1fbc2927cc6eb7714332522000e7e
|
|
| BLAKE2b-256 |
a0311e5213b408fb8fc2bf1a0858a09957ca4b085e0c67556d7cc7c9cbfe0ea1
|