Smart multimodal router — LLM inference, image generation, speech-to-text, and embeddings across your device fleet
Project description
Ollama Herd
Smart inference router that herds your Ollama instances into one endpoint. Auto-discovers nodes via mDNS, scores them on 7 signals (thermal state, memory fit, queue depth, latency history, role affinity, availability trend, context fit), and routes each request to the optimal device. OpenAI-compatible API with real-time dashboard.
Why
You have multiple machines with GPUs sitting around. You want one endpoint that makes them act like one system — picking the right device for each request automatically, without manual load balancing or config files.
Quick start
pip install ollama-herd
On your router machine:
herd
On each device running Ollama:
herd-node
That's it. The node discovers the router via mDNS and starts sending heartbeats. No config files needed.
To skip mDNS and connect directly:
herd-node --router-url http://router-ip:11435
Usage
Already using Ollama or the OpenAI SDK? Just swap your base URL to the router. No code changes needed — same model names, same API, same streaming. The router handles picking the best machine.
Point any OpenAI-compatible client at the router:
from openai import OpenAI
client = OpenAI(base_url="http://router-ip:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
Or use the Ollama API directly:
curl http://router-ip:11435/api/chat -d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Both formats support streaming and non-streaming. Responses include real token usage counts.
Model Fallbacks
Specify backup models in case the primary isn't available:
curl http://router-ip:11435/v1/chat/completions -d '{
"model": "llama3.3:70b",
"fallback_models": ["qwen2.5:32b", "qwen2.5:7b"],
"messages": [{"role": "user", "content": "Hello!"}]
}'
The router tries each model in order. If one is unavailable, it seamlessly falls back to the next. See Operations Guide.
Request Tagging
Tag requests to track performance and usage per application, team, or environment:
curl http://router-ip:11435/v1/chat/completions -d '{
"model": "llama3.2:3b",
"metadata": {"tags": ["my-app", "production"]},
"messages": [{"role": "user", "content": "Hello!"}]
}'
Or via header (useful for proxies and middleware):
curl -H "X-Herd-Tags: my-app, production" \
http://router-ip:11435/v1/chat/completions -d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Tags appear in the Apps dashboard tab with per-tag latency, tokens, error rates, and daily trends. See Request Tagging for the full guide including competitive landscape and framework integration examples.
Switching your AI agent to Herd
Copy-paste this to any AI coding agent to have it update your Ollama configuration:
Update the Ollama configuration to use Ollama Herd as the inference router. Ollama Herd is a smart inference router that sits in front of multiple Ollama instances across a LAN, auto-discovers nodes via mDNS, and routes each request to the optimal device based on thermal state, memory fit, queue depth, latency history, and role affinity. Comment out the current direct Ollama base URL (e.g.,
http://localhost:11434) but keep it as a comment for reference. Replace it with the Ollama Herd router URL:http://<router-ip>:11435. For OpenAI-compatible clients, usehttp://<router-ip>:11435/v1as the base URL. The API key can be any non-empty string (e.g.,"not-needed"). The API is fully compatible with both OpenAI and Ollama formats — same model names, same endpoints, same streaming. No other code changes are needed.
Tagging requests for per-project analytics:
Tag all requests to Ollama Herd so we can track usage per project and process. Add a
metadatafield with atagsarray to every request body. Use two tags: one for the project name and one for the script or process making the request. For example:"metadata": {"tags": ["my-project", "code-review"]}. If you're using the OpenAI SDK, pass it viaextra_body:client.chat.completions.create(..., extra_body={"metadata": {"tags": ["my-project", "code-review"]}}). If you can't modify the request body (e.g., reverse proxy or middleware), use theX-Herd-Tagsheader instead:X-Herd-Tags: my-project, code-review. Tags appear in the Herd dashboard under the Apps tab with per-tag latency, token counts, error rates, and daily trends. Keep tag names short, lowercase, and hyphenated.
Beyond LLMs — image generation, speech-to-text, embeddings
The same router handles four model types. Install the backend on any node and it's automatically detected:
Image generation
# Install backends (any combination — install what you need)
uv tool install mflux # Flux models (fastest: ~7s at 512px)
uv tool install diffusionkit # Stable Diffusion 3/3.5 (~9s at 512px)
ollama pull x/z-image-turbo # Ollama native (experimental)
# macOS 26 users: DiffusionKit needs a one-time patch
./scripts/patch-diffusionkit-macos26.sh
Generate an image through the fleet:
curl -o sunset.png http://router-ip:11435/api/generate-image \
-H "Content-Type: application/json" \
-d '{"model": "z-image-turbo", "prompt": "a sunset over mountains", "width": 1024, "height": 1024}'
Available models: z-image-turbo, flux-dev, flux-schnell (mflux), sd3-medium, sd3.5-large (DiffusionKit), x/z-image-turbo, x/flux2-klein (Ollama native). See Image Generation Guide.
Speech-to-text
curl http://router-ip:11435/api/transcribe -F "file=@meeting.wav" -F "model=qwen3-asr"
Embeddings
curl http://router-ip:11435/api/embed \
-d '{"model": "nomic-embed-text", "input": "your text here"}'
How routing works
Every request goes through a scoring pipeline that picks the best device in real time:
- Elimination — offline nodes, missing models, insufficient memory, and critical memory pressure are filtered out
- Thermal state (+50 pts) — models already loaded in GPU memory ("hot") score highest; recently unloaded ("warm") get a partial bonus
- Memory fit (+20 pts) — nodes with more available headroom score higher
- Queue depth (−30 pts) — busy nodes get penalized (capped so no node is starved)
- Latency history (−25 pts) — past p75 latency from SQLite informs expected wait time
- Role affinity (+15 pts) — large models prefer big machines, small models prefer small ones
- Context fit (+15 pts) — nodes with loaded context windows that fit the request's estimated token count score higher
The highest-scoring node wins. If no node is available, the request enters a holding queue and retries until one frees up or times out.
For full details on the scoring algorithm, pre-warm triggers, and rebalancer: Fleet Manager Routing Engine.
Resilience
- Auto-retry — if a node fails before the first response chunk, the router re-scores and retries on the next-best node (up to 2 retries)
- Model fallbacks — clients specify backup models; the router tries alternatives when the primary model has no available nodes
- Context protection — strips
num_ctxfrom requests when unnecessary (prevents Ollama from reloading 89GB models); auto-upgrades to a larger loaded model when more context is genuinely needed - VRAM-aware fallback — routes to an already-loaded model in the same category instead of cold-loading the requested model
- Holding queue — requests wait (up to 30s) when all nodes are busy rather than immediately failing
- Graceful drain — when a node shuts down, in-flight requests finish and pending requests are redistributed
- Zombie reaper — background task detects and cleans up stuck in-flight requests that would otherwise permanently consume queue slots
See Operations Guide for details.
Adaptive Capacity Learning
Laptops aren't servers — their owners use them for meetings, coding, and browsing. The adaptive capacity system learns when each device has spare compute:
- 168-slot behavioral model — learns your weekly usage patterns (7 days × 24 hours)
- Meeting detection — camera/mic active → hard pause (macOS)
- App fingerprinting — classifies workload intensity from resource signatures, privacy-first (no app name reading)
- Dynamic memory ceiling — availability score maps to how much RAM the router can use for Ollama
Enable with FLEET_NODE_ENABLE_CAPACITY_LEARNING=true. See Adaptive Capacity Learning.
Dashboard
The built-in dashboard at /dashboard provides eight views:
- Fleet Overview — live node status, CPU/memory metrics, loaded models, and request queue depths via Server-Sent Events
- Trends — historical charts for requests per hour, average latency, and token throughput (prompt + completion) with selectable time ranges (24h–7d)
- Model Insights — per-model comparison of latency, tokens/sec, and usage; token distribution doughnut chart; clickable rows for daily breakdown
- Apps — per-tag analytics with request volume, latency, tokens, error rates, and daily trends; tag your requests to see per-application breakdowns
- Benchmarks — capacity growth over time with per-run throughput, latency percentiles, per-model and per-node breakdowns
- Health — fleet health analysis with 11 automated checks (offline nodes, memory pressure, thrashing, timeouts, error rates, version mismatch, context protection, zombie reaper)
- Recommendations — AI-powered model mix recommendations per node based on hardware, usage patterns, and curated benchmark data; select which models to pull and download them directly from the dashboard
- Settings — runtime toggle switches for auto-pull and VRAM fallback, read-only config tables grouped by category, and node list with version tracking and Router badge
All powered by Chart.js and a SQLite-backed latency store. No external database required.
Observability
- Per-request traces — every routing decision is recorded with scores, node selection, latency, tokens, tags, retry/fallback status
- Per-app analytics — tag requests with
metadata.tagsorX-Herd-Tagsheader for per-application breakdowns - Usage stats — per-node, per-model, per-day aggregates via
/dashboard/api/usage - JSONL structured logging — daily rotation to
~/.fleet-manager/logs/herd.jsonl, 30-day retention
See Operations Guide for log queries, trace access, and debugging.
API endpoints
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
OpenAI-compatible chat (streaming + non-streaming) |
GET /v1/models |
List all models across the herd |
POST /api/chat |
Ollama-compatible chat |
POST /api/generate |
Ollama-compatible generate |
GET /api/tags |
Ollama-compatible model list |
GET /api/ps |
Running models across all nodes |
GET /fleet/status |
Herd state: nodes, queues, metrics |
GET /dashboard |
Real-time web dashboard |
GET /dashboard/events |
SSE stream for live fleet updates |
GET /dashboard/api/trends |
Hourly aggregated stats (JSON) |
GET /dashboard/api/models |
Per-model daily stats (JSON) |
GET /dashboard/api/overview |
Summary totals (JSON) |
GET /dashboard/api/usage |
Per-node per-model usage (JSON) |
GET /dashboard/api/apps |
Per-tag aggregated stats (JSON) |
GET /dashboard/api/apps/daily |
Per-tag daily breakdown (JSON) |
GET /dashboard/api/traces |
Recent request traces (JSON) |
GET /dashboard/api/benchmarks |
Benchmark run history (JSON) |
POST /dashboard/api/benchmarks |
Save benchmark results (JSON) |
GET /dashboard/api/health |
Fleet health analysis (JSON) |
GET /dashboard/api/recommendations |
Model mix recommendations per node (JSON, cached 5m) |
POST /dashboard/api/pull |
Pull a model onto a specific node |
GET /dashboard/api/model-management |
Per-node model details with sizes, usage stats, last-used timestamps |
POST /dashboard/api/delete |
Delete a model from a specific node |
GET /dashboard/benchmarks |
Benchmarks dashboard page |
GET /dashboard/health |
Health dashboard page |
GET /dashboard/recommendations |
Model recommendations dashboard page |
GET /dashboard/settings |
Settings dashboard page |
GET /dashboard/api/settings |
Current config, toggles, and node list (JSON) |
POST /dashboard/api/settings |
Toggle runtime-mutable settings (auto_pull, vram_fallback) |
Full request/response schemas: API Reference.
Agent Framework Integration
Every major agent framework supports custom base_url — point it at Herd and your agents run across your entire device fleet:
# LangChain
llm = ChatOpenAI(base_url="http://router-ip:11435/v1", model="llama3.3:70b", api_key="none")
# CrewAI
llm = LLM(model="ollama/llama3.3:70b", base_url="http://router-ip:11435")
# OpenHands
export LLM_BASE_URL=http://router-ip:11435/v1
Compatible with: OpenClaw, LangChain, CrewAI, AutoGen, LlamaIndex, Haystack, smolagents, OpenHands, Aider, Cline, Continue.dev, Bolt.diy, and any OpenAI-compatible client.
See OpenClaw Integration Guide for the full compatibility matrix.
Design Philosophy
Six principles shape every decision in this project:
-
Every node stands alone — Each device is sovereign. It runs its own Ollama, manages its own models, learns its own capacity patterns, and works fine without the router. The router coordinates but never controls. No central config file. No dependency chains. A node that loses connectivity keeps serving local inference.
-
Two-person scale — Two CLI commands, zero config files, zero Docker. If it requires a manual, it's too complex. Every architectural choice picks the simple thing (HTTP heartbeats over gRPC, SQLite over Postgres, mDNS over etcd). The whole codebase fits in one person's head.
-
Human-readable state — JSONL logs you can
grep. SQLite you can query with standard tools. JSON config on disk. Env vars for settings. No opaque binary formats. If you can't debug it withcatandsqlite3, it's wrong. -
The inference request is primary — Scoring, queuing, retry, fallback, capacity learning, meeting detection — everything exists to serve one thing: get the best response on the best machine as fast as possible. If a feature doesn't serve that, it doesn't belong.
-
AI as resident, not visitor — The system accumulates knowledge over time. The capacity learner builds a 168-slot behavioral model of your week. The latency store remembers which nodes are fast for which models. The trace store records every routing decision. It gets smarter the longer it runs.
-
Shared DNA, not shared code — The scoring pipeline (eliminate → score → rank → select), heartbeat-based coordination, and adaptive capacity learning are transferable patterns, not a framework. Specific tool, transferable DNA.
Architecture
┌─────────────────────────────────────────────────────┐
│ Client (OpenAI SDK, curl, any HTTP client) │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Herd Router (:11435) │
│ ┌────────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Scoring │ │ Queue │ │ Streaming Proxy │ │
│ │ Engine │ │ Manager │ │ (format convert) │ │
│ └────────────┘ └──────────┘ └───────────────────┘ │
│ ┌────────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Latency │ │ Rebal- │ │ Dashboard + │ │
│ │ Store │ │ ancer │ │ SSE + Charts │ │
│ └────────────┘ └──────────┘ └───────────────────┘ │
│ ┌────────────┐ ┌──────────┐ │
│ │ Trace │ │ Pre- │ │
│ │ Store │ │ Warm │ │
│ └────────────┘ └──────────┘ │
└──────────┬──────────────────────────┬───────────────┘
│ heartbeats │ inference
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Herd Node A │ │ Herd Node B │
│ (agent + Ollama)│ │ (agent + Ollama)│
│ ┌────────────┐ │ │ ┌────────────┐ │
│ │ Capacity │ │ │ │ LAN Proxy │ │
│ │ Learner │ │ │ │ (auto TCP) │ │
│ └────────────┘ │ └──└────────────┘──┘
└──────────────────┘
Two CLI entry points, one Python package:
herd— FastAPI server with scoring, queues, streaming proxy, trace store, and dashboardherd-node— lightweight agent that collects system metrics, sends heartbeats, and optionally learns capacity patterns
Optimize Ollama for your hardware
Ollama's defaults are conservative. On machines with lots of memory, you're probably leaving performance on the table. These settings tell Ollama to actually use the hardware you paid for:
# Keep models loaded permanently (default: 5m — unloads after 5 minutes of idle!)
# On a 512GB Mac Studio, there's zero reason to unload a model after 5 minutes
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
# Allow multiple models in memory simultaneously (default: auto, but often conservative)
# Set to -1 for unlimited — let Ollama load as many as fit in memory
launchctl setenv OLLAMA_MAX_LOADED_MODELS "-1"
# Restart Ollama app after changing these (⌘Q and reopen)
Herd handles this automatically for routed requests — every request proxied through the router includes keep_alive: -1, so models loaded via Herd stay loaded regardless of Ollama's server-side default. But you should still set the env var to cover models loaded directly (e.g., ollama run) and to prevent Ollama from evicting idle models between requests.
| Setting | Default | Recommended | Why |
|---|---|---|---|
OLLAMA_KEEP_ALIVE |
5m |
-1 (forever) |
Don't unload models from memory when you have RAM to spare |
OLLAMA_MAX_LOADED_MODELS |
auto | -1 (unlimited) |
Let multiple models stay hot simultaneously |
OLLAMA_NUM_PARALLEL |
auto | 2–4 for multi-model fleets |
Auto-calculated value can be very high on large-memory machines (e.g., 16), causing massive KV cache allocation per model — see warning below |
Warning:
OLLAMA_NUM_PARALLELand KV cache bloat. On high-memory machines, Ollama auto-calculates a high parallel slot count (e.g., 16). Each slot pre-allocates KV cache for the full context window. With 16 slots × 262K context, a single model can consume 384 GB of KV cache on top of its weights — leaving no room for other models and causing constant eviction thrashing. If you run multiple models, setOLLAMA_NUM_PARALLELto2–4:launchctl setenv OLLAMA_NUM_PARALLEL 2 # 2 parallel slots × 262K ctx ≈ 20 GB KV cache per modelThis lets multiple models coexist in memory instead of one model monopolizing all VRAM.
Quick check — run ollama ps and look at the "Until" column:
NAME SIZE UNTIL
gpt-oss:120b 88 GB Forever ← good: model stays loaded
qwen3.5:122b 87 GB Forever ← good: both hot, no thrashing
If you see a timestamp instead of "Forever", your keep-alive is too short.
macOS note:
launchctl setenvsets the variable for the GUI session. Forollama servefrom the terminal, useexport OLLAMA_KEEP_ALIVE=-1instead. On Linux, add it to your systemd service file or shell profile.
Configuration
All settings via environment variables. See Configuration Reference for the complete list of 31+ variables with tuning guidance.
Common variables
| Variable | Default | Description |
|---|---|---|
FLEET_PORT |
11435 |
Router listen port |
FLEET_HOST |
0.0.0.0 |
Router bind address |
FLEET_HEARTBEAT_INTERVAL |
5.0 |
Heartbeat check interval (seconds) |
FLEET_HEARTBEAT_TIMEOUT |
15.0 |
Mark node degraded after (seconds) |
FLEET_HEARTBEAT_OFFLINE |
30.0 |
Mark node offline after (seconds) |
FLEET_MAX_RETRIES |
2 |
Auto-retry attempts on node failure |
FLEET_LOG_LEVEL |
DEBUG |
JSONL log file level |
Node settings use the FLEET_NODE_ prefix:
| Variable | Default | Description |
|---|---|---|
FLEET_NODE_OLLAMA_HOST |
http://localhost:11434 |
Local Ollama URL |
FLEET_NODE_ROUTER_URL |
(auto-discover) | Router URL (skips mDNS) |
FLEET_NODE_ENABLE_CAPACITY_LEARNING |
false |
Enable adaptive capacity system |
Development
git clone https://github.com/geeks-accelerator/ollama-herd.git
cd ollama-herd
uv sync # install deps
uv run herd # start router
uv run herd-node # start node agent
uv run pytest -v # run all 412 tests (~5s)
uv run ruff check src/ # lint
uv run ruff format src/ # format
Documentation
| Document | Description |
|---|---|
| API Reference | All endpoints with request/response schemas |
| Configuration Reference | All 29+ environment variables with tuning guidance |
| Operations Guide | Logging, traces, fallbacks, retry, drain, pre-warm, streaming |
| Adaptive Capacity | Capacity learner, meeting detection, app fingerprinting |
| Routing Engine | 5-stage scoring pipeline deep dive |
| OpenClaw Integration | Setup guide for OpenClaw agents |
| Request Tagging | Per-app analytics, tagging strategies, competitive landscape |
| Troubleshooting | Common issues, LAN debugging, operational gotchas |
| Architecture Decisions | Port selection, design trade-offs, rationale |
What's Next
The fleet is smart but passive — it waits for requests. The next evolution is an agentic router that uses idle compute proactively:
- Task backlogs — drop tasks throughout the day, the fleet chews through them when idle
- Pattern-driven pre-warming — the capacity learner already knows your weekly rhythm, the router should act on it
- Agentic decomposition — complex tasks broken into subtask DAGs, executed in parallel across the fleet
- Fleet health opinions — the router surfaces observations, not just metrics
"Your fleet doesn't just wait for requests — it works for you while you sleep."
Scale your AI agent's brain
Running an AI coding agent like OpenClaw, Aider, or Continue.dev with a local Ollama? You're limited to one machine's GPU. Ollama Herd turns every device on your network into extra capacity — your laptop, your desktop, that Mac Mini in the closet.
- Install Ollama on each device and pull the models you want
- Run
herd-nodeon each device (one command, zero config) - Run
herdon any machine to start the router - Point your agent at
http://router-ip:11435/v1instead ofhttp://localhost:11434
Your agent doesn't know or care that multiple machines are behind the endpoint. It sees one API with the same models, same streaming, same formats. The router picks the best device for each request — the one with the model already loaded, the most free memory, the lowest queue depth. When one machine is busy in a meeting, requests flow to the others automatically.
This is especially powerful for agentic workflows that fire many parallel requests — code review, test generation, documentation — the fleet absorbs the burst across all available GPUs instead of queuing everything on one.
Contributing
Whether you're carbon-based or silicon-based, contributions are welcome. This project is built by humans and AI agents working together — every commit, every observation, every pattern.
For humans: Fork it, run the tests (uv run pytest), make your change, open a PR. The codebase is designed to fit in one person's head. Start with Architecture Decisions to understand why things are the way they are.
For AI agents: Read CLAUDE.md first — it's your onboarding doc. The project uses docs/issues.md to track what's broken and docs/observations.md to accumulate what we've learned. After making a significant change, check if your work produced a new observation or revealed a new issue, and append it. That's how the project gets smarter across sessions.
Good first contributions:
- Pick an open issue from
docs/issues.mdand fix it - Add test coverage for an untested module (see issue #10)
- Run the fleet and add an observation to
docs/observations.md - Integrate with a new agent framework and document it
⭐ If Ollama Herd is useful to you, star the repo — it helps others discover the project and keeps the herd growing.
Requirements
- Python 3.11+
- Ollama running on each device
- Multi-device setups work automatically — the node agent starts a LAN proxy if Ollama is only listening on localhost
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollama_herd-0.3.0.tar.gz.
File metadata
- Download URL: ollama_herd-0.3.0.tar.gz
- Upload date:
- Size: 363.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39cc4a11cb53d140ee4f40cc7723c730e3f2ba54a04ce1884e985f4205970461
|
|
| MD5 |
a2bbd8e894463782c84b15ade32fa06d
|
|
| BLAKE2b-256 |
d9d9cfa5977c575444dad99afb7d0e21a40cfea62fc9511fe5808735f317f5b9
|
File details
Details for the file ollama_herd-0.3.0-py3-none-any.whl.
File metadata
- Download URL: ollama_herd-0.3.0-py3-none-any.whl
- Upload date:
- Size: 135.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ad8079166d2972768fceef76696831e840822be68528280ee1a7fd3eead3d97
|
|
| MD5 |
fbc7c0e7ddb3d84d86f2ef621e12ba0a
|
|
| BLAKE2b-256 |
2beee903cad958bf5cabb9671af044e3dd708e60ec2e2218bcb11a4233ab55e5
|