Production helpers for running Ollama under concurrent load across local and remote endpoints.
Project description
ollama-orchestra
Production helpers for running Ollama under concurrent load.
Ollama is excellent for local models, but production pipelines quickly hit coordination problems: one GPU should usually receive one request at a time, multi-GPU ingestion needs endpoint rotation, embedding endpoints need fallback, and reasoning models may burn their token budget before producing visible content.
ollama-orchestra packages those patterns into small async utilities.
Install
uv add ollama-orchestra
Concurrency control
from ollama_orchestra import OllamaSemaphorePool, RoundRobinOllama
pool = OllamaSemaphorePool(local_hosts={"gpu-a.local", "gpu-b.local"})
rr = RoundRobinOllama(["http://gpu-a.local:11434", "http://gpu-b.local:11434"])
url = await rr.next_url()
async with pool.semaphore(url):
# Call your Ollama client here. Local Ollama endpoints default to 1 slot.
...
Ports 11434 are treated as local Ollama endpoints by default. Other URLs default to higher concurrency for OpenAI-compatible gateways or cloud APIs.
Reasoning models gotcha
Some Ollama reasoning models can spend the whole num_predict budget inside hidden reasoning and return an empty visible message with done_reason: "length".
Ollama expects think: false at the top level of the request body, not inside options.
from ollama_orchestra import chat
result = await chat(
"http://localhost:11434",
"your-model",
[{"role": "user", "content": "Summarize this log"}],
think=False,
num_predict=256,
)
The helper also strips leftover <think>, <reasoning>, <thought>, and simple Markdown fences from returned content by default.
Embeddings with fallback
from ollama_orchestra import EmbeddingService
service = EmbeddingService(
model="your-embedding-model",
urls=["http://gpu-a.local:11434", "http://gpu-b.local:11434"],
)
vector = await service.embed_text("Long text is chunked and mean-pooled automatically.")
await service.close()
Features:
- endpoint fallback
- endpoint scoring based on success, failure, and latency
- per-endpoint circuit breakers
- temporary quarantine for failing endpoints
- optional alert callback
- long-text chunking and mean pooling
Health and prewarm
from ollama_orchestra import check_server_health, prewarm_all_servers
healthy = await check_server_health("http://localhost:11434")
status = await prewarm_all_servers(["http://localhost:11434"], model="your-model")
Documentation and examples
docs/reasoning-models.mdexplains Ollama's top-levelthink: falsegotcha.docs/production-patterns.mddocuments concurrency, round-robin, prewarm, and fallback patterns.examples/reasoning_chat.pycalls Ollama chat with reasoning disabled.examples/multi_endpoint_embeddings.pydemonstrates embedding fallback across endpoints.examples/semaphore_pool.pydemonstrates per-endpoint concurrency control.
Roadmap
- Adaptive concurrency based on latency and endpoint health.
- Streaming chat helper.
- Additional gateway-compatible health checks.
Metrics hooks
Both semaphore and embedding workflows accept optional callbacks for lightweight instrumentation:
events = []
pool = OllamaSemaphorePool(metrics_cb=events.append)
service = EmbeddingService("your-embedding-model", ["http://localhost:11434"], metrics_cb=events.append)
Events are dictionaries with an event key, such as semaphore_acquired, embedding_failure, or embedding_endpoint_quarantined.
Development
uv sync --dev
uv run ruff check .
uv run pytest
uv run python scripts/smoke.py
uv build
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollama_orchestra-0.1.5.tar.gz.
File metadata
- Download URL: ollama_orchestra-0.1.5.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31c6d09b975893058ee169943a8c9dd4ecfa85c7d2f71336cbd2938e118783e6
|
|
| MD5 |
bc578b0fd03e516745730aba6d7e71fb
|
|
| BLAKE2b-256 |
16f7e7e9055e39008b3ebb77dd09d396dffa4982519291792027ed212f6c3dc1
|
Provenance
The following attestation bundles were made for ollama_orchestra-0.1.5.tar.gz:
Publisher:
publish.yml on BenjaminJornet/ollama-orchestra
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ollama_orchestra-0.1.5.tar.gz -
Subject digest:
31c6d09b975893058ee169943a8c9dd4ecfa85c7d2f71336cbd2938e118783e6 - Sigstore transparency entry: 1698479692
- Sigstore integration time:
-
Permalink:
BenjaminJornet/ollama-orchestra@632bb38081329afaff2f9d7eb17ec8363352ae3f -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/BenjaminJornet
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@632bb38081329afaff2f9d7eb17ec8363352ae3f -
Trigger Event:
release
-
Statement type:
File details
Details for the file ollama_orchestra-0.1.5-py3-none-any.whl.
File metadata
- Download URL: ollama_orchestra-0.1.5-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c967925cc0ae899dcf992706927964d714971557c217c1a6811a65a85c071a7
|
|
| MD5 |
7cd5392e26a58ed87f4cf01918a8c45f
|
|
| BLAKE2b-256 |
824a2fb053dca71e94d745b5c6ad0d544a700093832ff2077671eb5a56fd3c38
|
Provenance
The following attestation bundles were made for ollama_orchestra-0.1.5-py3-none-any.whl:
Publisher:
publish.yml on BenjaminJornet/ollama-orchestra
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ollama_orchestra-0.1.5-py3-none-any.whl -
Subject digest:
6c967925cc0ae899dcf992706927964d714971557c217c1a6811a65a85c071a7 - Sigstore transparency entry: 1698479882
- Sigstore integration time:
-
Permalink:
BenjaminJornet/ollama-orchestra@632bb38081329afaff2f9d7eb17ec8363352ae3f -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/BenjaminJornet
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@632bb38081329afaff2f9d7eb17ec8363352ae3f -
Trigger Event:
release
-
Statement type: