Production helpers for running Ollama under concurrent load across local and remote endpoints.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

benjamin-j

These details have not been verified by PyPI

Project description

ollama-orchestra

Production helpers for running Ollama under concurrent load.

Ollama is excellent for local models, but production pipelines quickly hit coordination problems: one GPU should usually receive one request at a time, multi-GPU ingestion needs endpoint rotation, embedding endpoints need fallback, and reasoning models may burn their token budget before producing visible content.

ollama-orchestra packages those patterns into small async utilities.

Install

uv add ollama-orchestra

Concurrency control

from ollama_orchestra import OllamaSemaphorePool, RoundRobinOllama

pool = OllamaSemaphorePool(local_hosts={"gpu-a.local", "gpu-b.local"})
rr = RoundRobinOllama(["http://gpu-a.local:11434", "http://gpu-b.local:11434"])

url = await rr.next_url()
async with pool.semaphore(url):
    # Call your Ollama client here. Local Ollama endpoints default to 1 slot.
    ...

Ports 11434 are treated as local Ollama endpoints by default. Other URLs default to higher concurrency for OpenAI-compatible gateways or cloud APIs.

Reasoning models gotcha

Some Ollama reasoning models can spend the whole num_predict budget inside hidden reasoning and return an empty visible message with done_reason: "length".

Ollama expects think: false at the top level of the request body, not inside options.

from ollama_orchestra import chat

result = await chat(
    "http://localhost:11434",
    "your-model",
    [{"role": "user", "content": "Summarize this log"}],
    think=False,
    num_predict=256,
)

The helper also strips leftover <think>, <reasoning>, <thought>, and simple Markdown fences from returned content by default.

For streaming responses, use stream_chat():

from ollama_orchestra import stream_chat

async for chunk in stream_chat(
    "http://localhost:11434",
    "your-model",
    [{"role": "user", "content": "Summarize this log"}],
    think=False,
):
    print(chunk)

Embeddings with fallback

from ollama_orchestra import EmbeddingService

service = EmbeddingService(
    model="your-embedding-model",
    urls=["http://gpu-a.local:11434", "http://gpu-b.local:11434"],
)

vector = await service.embed_text("Long text is chunked and mean-pooled automatically.")
await service.close()

Features:

endpoint fallback
endpoint scoring based on success, failure, and latency
per-endpoint circuit breakers
temporary quarantine for failing endpoints
optional alert callback
long-text chunking and mean pooling

Orchestrated Chat

To route chat and reasoning requests across multiple endpoints with concurrency controls and endpoint scoring, use OrchestratedChat:

from ollama_orchestra import OrchestratedChat

service = OrchestratedChat(
    model="your-reasoning-model",
    urls=["http://gpu-a.local:11434", "http://gpu-b.local:11434"],
)

response = await service.chat([{"role": "user", "content": "Explain this alert"}], think=False)

Features:

endpoint fallback and scoring
concurrency pool integration (OllamaSemaphorePool)
circuit breakers and quarantine
reasoning stripping (<think> blocks are stripped by default)

Use endpoint_status() to inspect the current routing scores and quarantine state:

for endpoint in service.endpoint_status():
    print(endpoint["url"], endpoint["score"], endpoint["quarantined"])

Health and prewarm

from ollama_orchestra import check_server_health, prewarm_all_servers

healthy = await check_server_health("http://localhost:11434")
status = await prewarm_all_servers(["http://localhost:11434"], model="your-model")

Documentation and examples

docs/reasoning-models.md explains Ollama's top-level think: false gotcha.
docs/production-patterns.md documents concurrency, round-robin, prewarm, and fallback patterns.
examples/reasoning_chat.py calls Ollama chat with reasoning disabled.
examples/multi_endpoint_embeddings.py demonstrates embedding fallback across endpoints.
examples/semaphore_pool.py demonstrates per-endpoint concurrency control.

Roadmap

Adaptive concurrency based on latency and endpoint health.
Streaming chat helper.
Additional gateway-compatible health checks.

Metrics hooks

Both semaphore and embedding workflows accept optional callbacks for lightweight instrumentation:

events = []
pool = OllamaSemaphorePool(metrics_cb=events.append)
service = EmbeddingService("your-embedding-model", ["http://localhost:11434"], metrics_cb=events.append)

Events are dictionaries with an event key, such as semaphore_acquired, embedding_failure, or embedding_endpoint_quarantined.

Development

uv sync --dev
uv run ruff check .
uv run pytest
uv run python scripts/smoke.py
uv build

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

benjamin-j

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.8

Jun 3, 2026

0.1.7

Jun 2, 2026

0.1.6

Jun 1, 2026

0.1.5

Jun 1, 2026

0.1.4

Jun 1, 2026

0.1.3

Jun 1, 2026

0.1.2

Jun 1, 2026

0.1.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_orchestra-0.1.8.tar.gz (26.9 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ollama_orchestra-0.1.8-py3-none-any.whl (15.2 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file ollama_orchestra-0.1.8.tar.gz.

File metadata

Download URL: ollama_orchestra-0.1.8.tar.gz
Upload date: Jun 3, 2026
Size: 26.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ollama_orchestra-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`00d2a1451c95435afba70277a5a39ea34c570d1f297a8690c3ba4f2fcd4d451b`
MD5	`54cd25c5b19ad4fa158f5e562a7d8270`
BLAKE2b-256	`c283970d4ad6c1e731622a9e2b74677c6746d8c3d0e7166fc3a67300c707bcac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ollama_orchestra-0.1.8.tar.gz:

Publisher: publish.yml on BenjaminJornet/ollama-orchestra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ollama_orchestra-0.1.8.tar.gz
- Subject digest: 00d2a1451c95435afba70277a5a39ea34c570d1f297a8690c3ba4f2fcd4d451b
- Sigstore transparency entry: 1710471122
- Sigstore integration time: Jun 3, 2026
Source repository:
- Permalink: BenjaminJornet/ollama-orchestra@62957b5faf39d830f041cf6b9bf3a33e326e64bf
- Branch / Tag: refs/tags/v0.1.8
- Owner: https://github.com/BenjaminJornet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@62957b5faf39d830f041cf6b9bf3a33e326e64bf
- Trigger Event: release

File details

Details for the file ollama_orchestra-0.1.8-py3-none-any.whl.

File metadata

Download URL: ollama_orchestra-0.1.8-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 15.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ollama_orchestra-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37e007187e1bae89e97149e1b72d88be85ed9e7d9520ce14655c4a3c85da31fe`
MD5	`1c2da91cd89ca39ebc5ed39e71ada31e`
BLAKE2b-256	`24bad675ac3709a3fcc82780fe9f2bc260b0c0088df71a37799efb5aec093ab9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ollama_orchestra-0.1.8-py3-none-any.whl:

Publisher: publish.yml on BenjaminJornet/ollama-orchestra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ollama_orchestra-0.1.8-py3-none-any.whl
- Subject digest: 37e007187e1bae89e97149e1b72d88be85ed9e7d9520ce14655c4a3c85da31fe
- Sigstore transparency entry: 1710471146
- Sigstore integration time: Jun 3, 2026
Source repository:
- Permalink: BenjaminJornet/ollama-orchestra@62957b5faf39d830f041cf6b9bf3a33e326e64bf
- Branch / Tag: refs/tags/v0.1.8
- Owner: https://github.com/BenjaminJornet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@62957b5faf39d830f041cf6b9bf3a33e326e64bf
- Trigger Event: release

ollama-orchestra 0.1.8

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ollama-orchestra

Install

Concurrency control

Reasoning models gotcha

Embeddings with fallback

Orchestrated Chat

Health and prewarm

Documentation and examples

Roadmap

Metrics hooks

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance