Skip to main content

Production helpers for running Ollama under concurrent load across local and remote endpoints.

Project description

ollama-orchestra

CI PyPI License: MIT Python

Production helpers for running Ollama under concurrent load.

Ollama is excellent for local models, but production pipelines quickly hit coordination problems: one GPU should usually receive one request at a time, multi-GPU ingestion needs endpoint rotation, embedding endpoints need fallback, and reasoning models may burn their token budget before producing visible content.

ollama-orchestra packages those patterns into small async utilities.

Install

uv add ollama-orchestra

Concurrency control

from ollama_orchestra import OllamaSemaphorePool, RoundRobinOllama

pool = OllamaSemaphorePool(local_hosts={"gpu-a.local", "gpu-b.local"})
rr = RoundRobinOllama(["http://gpu-a.local:11434", "http://gpu-b.local:11434"])

url = await rr.next_url()
async with pool.semaphore(url):
    # Call your Ollama client here. Local Ollama endpoints default to 1 slot.
    ...

Ports 11434 are treated as local Ollama endpoints by default. Other URLs default to higher concurrency for OpenAI-compatible gateways or cloud APIs.

Reasoning models gotcha

Some Ollama reasoning models can spend the whole num_predict budget inside hidden reasoning and return an empty visible message with done_reason: "length".

Ollama expects think: false at the top level of the request body, not inside options.

from ollama_orchestra import chat

result = await chat(
    "http://localhost:11434",
    "your-model",
    [{"role": "user", "content": "Summarize this log"}],
    think=False,
    num_predict=256,
)

The helper also strips leftover <think>, <reasoning>, <thought>, and simple Markdown fences from returned content by default.

Embeddings with fallback

from ollama_orchestra import EmbeddingService

service = EmbeddingService(
    model="your-embedding-model",
    urls=["http://gpu-a.local:11434", "http://gpu-b.local:11434"],
)

vector = await service.embed_text("Long text is chunked and mean-pooled automatically.")
await service.close()

Features:

  • endpoint fallback
  • endpoint scoring based on success, failure, and latency
  • per-endpoint circuit breakers
  • temporary quarantine for failing endpoints
  • optional alert callback
  • long-text chunking and mean pooling

Health and prewarm

from ollama_orchestra import check_server_health, prewarm_all_servers

healthy = await check_server_health("http://localhost:11434")
status = await prewarm_all_servers(["http://localhost:11434"], model="your-model")

Documentation and examples

  • docs/reasoning-models.md explains Ollama's top-level think: false gotcha.
  • docs/production-patterns.md documents concurrency, round-robin, prewarm, and fallback patterns.
  • examples/reasoning_chat.py calls Ollama chat with reasoning disabled.
  • examples/multi_endpoint_embeddings.py demonstrates embedding fallback across endpoints.
  • examples/semaphore_pool.py demonstrates per-endpoint concurrency control.

Roadmap

  • Adaptive concurrency based on latency and endpoint health.
  • Streaming chat helper.
  • Additional gateway-compatible health checks.

Metrics hooks

Both semaphore and embedding workflows accept optional callbacks for lightweight instrumentation:

events = []
pool = OllamaSemaphorePool(metrics_cb=events.append)
service = EmbeddingService("your-embedding-model", ["http://localhost:11434"], metrics_cb=events.append)

Events are dictionaries with an event key, such as semaphore_acquired, embedding_failure, or embedding_endpoint_quarantined.

Development

uv sync --dev
uv run ruff check .
uv run pytest
uv run python scripts/smoke.py
uv build

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_orchestra-0.1.5.tar.gz (24.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollama_orchestra-0.1.5-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file ollama_orchestra-0.1.5.tar.gz.

File metadata

  • Download URL: ollama_orchestra-0.1.5.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ollama_orchestra-0.1.5.tar.gz
Algorithm Hash digest
SHA256 31c6d09b975893058ee169943a8c9dd4ecfa85c7d2f71336cbd2938e118783e6
MD5 bc578b0fd03e516745730aba6d7e71fb
BLAKE2b-256 16f7e7e9055e39008b3ebb77dd09d396dffa4982519291792027ed212f6c3dc1

See more details on using hashes here.

Provenance

The following attestation bundles were made for ollama_orchestra-0.1.5.tar.gz:

Publisher: publish.yml on BenjaminJornet/ollama-orchestra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ollama_orchestra-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for ollama_orchestra-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6c967925cc0ae899dcf992706927964d714971557c217c1a6811a65a85c071a7
MD5 7cd5392e26a58ed87f4cf01918a8c45f
BLAKE2b-256 824a2fb053dca71e94d745b5c6ad0d544a700093832ff2077671eb5a56fd3c38

See more details on using hashes here.

Provenance

The following attestation bundles were made for ollama_orchestra-0.1.5-py3-none-any.whl:

Publisher: publish.yml on BenjaminJornet/ollama-orchestra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page