Skip to main content

Unified multi-LLM provider abstraction supporting OpenAI, Anthropic, Google, Groq, xAI and more

Project description

quartermaster-providers

Unified multi-LLM provider abstraction for Python. Write once, run against OpenAI, Anthropic, Google, Groq, xAI, or any OpenAI-compatible endpoint.

PyPI version Python 3.11+ License: Apache 2.0

Features

  • 6 Providers: OpenAI, Anthropic, Google, Groq, xAI, plus a generic OpenAI-compatible adapter
  • Streaming: Async generators for token-by-token responses
  • Tool Calling: Unified interface for function/tool invocation across all providers
  • Structured Output: JSON-schema-constrained response generation
  • Extended Thinking: Claude and o-series reasoning chains
  • Vision: Image understanding on supported models
  • Transcription: Audio-to-text via OpenAI Whisper
  • Token Counting: Estimate tokens and cost before making requests
  • Provider Registry: Register providers once, resolve by name or model pattern
  • Testing Utilities: MockProvider and InMemoryHistory for unit tests
  • Type-Safe: Dataclass responses with full type hints

Installation

pip install quartermaster-providers

Install with provider-specific extras:

pip install quartermaster-providers[openai]
pip install quartermaster-providers[anthropic]
pip install quartermaster-providers[openai,anthropic,google]
pip install quartermaster-providers[all]

Supported Providers

Provider Class Models (examples)
OpenAI OpenAIProvider gpt-4o, gpt-4-turbo, o1, o3-mini
Anthropic AnthropicProvider claude-sonnet-4-20250514, claude-3-haiku
Google GoogleProvider gemini-1.5-pro, gemini-pro
Groq GroqProvider llama-3-70b, mixtral-8x7b
xAI XAIProvider grok-2, grok-2-mini
Quartermaster QuartermasterProvider All models via one API key
Custom OpenAICompatibleProvider Any OpenAI-compatible API

Local / Self-Hosted Providers

Provider Class Description
Ollama OllamaProvider Local models via Ollama
vLLM VLLMProvider High-throughput inference server
LM Studio LMStudioProvider Desktop LLM app
TGI TGIProvider HuggingFace Text Generation Inference
LocalAI LocalAIProvider OpenAI-compatible local server
llama.cpp LlamaCppProvider llama.cpp HTTP server

Register local providers with one line — the module-level helper builds a registry, normalises base_url (auto-appends /v1 if missing), honours the OLLAMA_HOST env var, and remembers the default model so callers don't have to repeat it:

from quartermaster_providers import register_local

provider_registry = register_local(
    "ollama",
    base_url="http://localhost:11434",   # or set $OLLAMA_HOST
    default_model="gemma4:26b",
)
provider = provider_registry.get("ollama")

Sync OllamaProvider.chat() shim

For one-shot calls from sync code (Celery workers, Django views, CLI scripts) the OllamaProvider exposes a synchronous native /api/chat shim — no asgiref.async_to_sync wrapper required, and thinking / reasoning text is auto-promoted into content so reasoning models like gemma4:26b never return an empty result on short prompts:

from quartermaster_providers.providers.local import OllamaProvider

provider = OllamaProvider(default_model="gemma4:26b")  # honours $OLLAMA_HOST
result = provider.chat(
    messages=[
        {"role": "system", "content": "Respond in Slovenian."},
        {"role": "user", "content": "Pozdravljen!"},
    ],
    max_output_tokens=128,    # honoured — capped at Ollama's `num_predict`
    thinking_level="off",     # off / low / medium / high
)
print(result.content)         # str — promoted from `reasoning` if `content` empty
print(result.tool_calls)      # list[ToolCall]
print(result.usage)           # {prompt_tokens, completion_tokens, total_tokens}

Connection errors raise ServiceUnavailableError; HTTP errors raise ProviderError with status_code attached. Neither swallows into a soft "no answer" result the way the OpenAI-compat path used to.

Quick Start

Text Generation

import asyncio
from quartermaster_providers import LLMConfig
from quartermaster_providers.providers import OpenAIProvider

async def main():
    provider = OpenAIProvider(api_key="sk-...")
    config = LLMConfig(
        model="gpt-4o",
        provider="openai",
        temperature=0.7,
        max_output_tokens=1024,
    )

    response = await provider.generate_text_response(
        prompt="Explain gradient descent in two sentences.",
        config=config,
    )
    print(response.content)  # str
    print(response.stop_reason)  # "end_turn", "max_tokens", etc.

asyncio.run(main())

Tool Calling

import asyncio
from quartermaster_providers import LLMConfig, ToolDefinition
from quartermaster_providers.providers import AnthropicProvider

async def main():
    provider = AnthropicProvider(api_key="sk-ant-...")
    config = LLMConfig(model="claude-sonnet-4-20250514", provider="anthropic")

    tools = [
        ToolDefinition(
            name="get_weather",
            description="Get current weather for a location",
            input_schema={
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                },
                "required": ["location"],
            },
        ),
    ]

    response = await provider.generate_tool_parameters(
        prompt="What is the weather in Tokyo?",
        tools=tools,
        config=config,
    )

    for call in response.tool_calls:
        print(f"{call.tool_name}({call.parameters})")
        # get_weather({'location': 'Tokyo'})

    print(f"Usage: {response.usage.total_tokens} tokens")

asyncio.run(main())

Tool Calling with quartermaster-tools

Tools created with @tool() integrate directly via ToolDescriptor:

from quartermaster_tools import tool

@tool()
def get_weather(city: str) -> dict:
    """Get current weather for a city.

    Args:
        city: The city name to look up.
    """
    return {"city": city, "temperature": 22}

# Convert to provider-compatible format
tool_def = get_weather.info().to_anthropic_tools()
# Or for OpenAI:
tool_def = get_weather.info().to_openai_tools()

Streaming

import asyncio
from quartermaster_providers import LLMConfig
from quartermaster_providers.providers import OpenAIProvider

async def main():
    provider = OpenAIProvider(api_key="sk-...")
    config = LLMConfig(model="gpt-4o", provider="openai", stream=True)

    async for chunk in await provider.generate_text_response(
        prompt="Write a haiku about Python.",
        config=config,
    ):
        print(chunk.content, end="", flush=True)

asyncio.run(main())

Structured Output

import asyncio
from quartermaster_providers import LLMConfig
from quartermaster_providers.providers import OpenAIProvider

async def main():
    provider = OpenAIProvider(api_key="sk-...")
    config = LLMConfig(model="gpt-4o", provider="openai")

    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "summary": {"type": "string"},
            "topics": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["title", "summary", "topics"],
    }

    response = await provider.generate_structured_response(
        prompt="Analyze the concept of reinforcement learning.",
        response_schema=schema,
        config=config,
    )

    print(response.structured_output["title"])
    print(response.structured_output["topics"])

asyncio.run(main())

API Reference

LLMConfig

Controls request behavior across all providers.

from quartermaster_providers import LLMConfig

config = LLMConfig(
    model="gpt-4o",             # Provider model identifier
    provider="openai",          # Provider name
    stream=False,               # Stream token-by-token
    temperature=0.7,            # 0.0 (deterministic) to 2.0 (creative)
    system_message=None,        # System prompt
    max_input_tokens=None,      # Input token limit
    max_output_tokens=None,     # Output token limit
    max_messages=None,          # Conversation context limit
    vision=False,               # Enable image understanding
    thinking_enabled=False,     # Extended thinking (Claude, o-series)
    thinking_budget=None,       # Max thinking tokens
    top_p=None,                 # Nucleus sampling
    top_k=None,                 # Top-k sampling
    frequency_penalty=None,     # Frequency penalty (OpenAI)
    presence_penalty=None,      # Presence penalty (OpenAI)
)

AbstractLLMProvider Methods

Every provider implements these methods:

Method Returns Description
await list_models() list[str] Available model identifiers
estimate_token_count(text, model) int Token estimate without API call
prepare_tool(tool) Any Convert ToolDefinition to provider format
await generate_text_response(prompt, config) TokenResponse or AsyncIterator[TokenResponse] Text generation (streaming when config.stream=True)
await generate_tool_parameters(prompt, tools, config) ToolCallResponse Function/tool calling
await generate_native_response(prompt, tools, config) NativeResponse Text + thinking + tool calls combined
await generate_structured_response(prompt, schema, config) StructuredResponse JSON-schema-constrained output
await transcribe(audio_path) str Audio-to-text transcription

Cost estimation (non-abstract, returns None if pricing unavailable):

Method Returns
get_cost_per_1k_input_tokens(model) float | None
get_cost_per_1k_output_tokens(model) float | None
estimate_cost(text, model, output_tokens) float | None

Response Types

TokenResponse -- single response or streaming chunk:

response.content      # str -- text content
response.stop_reason  # str | None -- "end_turn", "max_tokens", "tool_use"

ToolCallResponse -- tool invocation results:

response.text_content  # str -- any text alongside tool calls
response.tool_calls    # list[ToolCall] -- each has .tool_name, .tool_id, .parameters
response.stop_reason   # str | None
response.usage         # TokenUsage | None

StructuredResponse -- JSON-schema-constrained output:

response.structured_output  # dict[str, Any] -- parsed JSON
response.raw_output         # str -- raw model text
response.usage              # TokenUsage | None

NativeResponse -- complete model output:

response.text_content  # str
response.thinking      # list[ThinkingResponse] -- reasoning blocks
response.tool_calls    # list[ToolCall]
response.usage         # TokenUsage | None

TokenUsage -- token accounting:

usage.input_tokens                  # int
usage.output_tokens                 # int
usage.cache_creation_input_tokens   # int (Anthropic prompt caching)
usage.cache_read_input_tokens       # int
usage.total_tokens                  # property: input + output

ProviderRegistry

Register providers once and resolve by name or model pattern:

from quartermaster_providers import ProviderRegistry
from quartermaster_providers.providers import OpenAIProvider, AnthropicProvider

registry = ProviderRegistry()
registry.register("openai", OpenAIProvider, api_key="sk-...")
registry.register("anthropic", AnthropicProvider, api_key="sk-ant-...")

# Get by name
provider = registry.get("openai")

# Auto-resolve from model name (gpt-* -> openai, claude-* -> anthropic, etc.)
provider = registry.get_for_model("gpt-4o")
provider = registry.get_for_model("claude-sonnet-4-20250514")

# List registered providers
registry.list_providers()  # ["anthropic", "openai"]

Model-to-provider inference patterns: gpt-*/o1-*/o3-* -> openai, claude-* -> anthropic, gemini-* -> google, llama-*/mixtral-* -> groq, grok-* -> xai.

Token Counting and Cost Estimation

from quartermaster_providers.providers import OpenAIProvider

provider = OpenAIProvider(api_key="sk-...")

tokens = provider.estimate_token_count("Hello, world!", "gpt-4o")
print(f"Estimated tokens: {tokens}")

cost = provider.estimate_cost("Hello, world!", "gpt-4o", output_tokens=100)
if cost is not None:
    print(f"Estimated cost: ${cost:.6f}")

Error Handling

All providers raise consistent exceptions from quartermaster_providers.exceptions:

from quartermaster_providers.exceptions import (
    ProviderError,          # Base exception (has .provider, .status_code)
    AuthenticationError,    # Invalid/missing API key (401)
    RateLimitError,         # Rate limited (429, has .retry_after)
    InvalidModelError,      # Model not available (404, has .model)
    InvalidRequestError,    # Malformed request (400)
    ContentFilterError,     # Blocked by safety filter (400)
    ContextLengthError,     # Input exceeds context window (400)
    ServiceUnavailableError,  # Provider temporarily down (503)
)

Testing

Use MockProvider for unit tests without real API calls:

from quartermaster_providers import LLMConfig, TokenResponse
from quartermaster_providers.testing import MockProvider

mock = MockProvider(responses=[
    TokenResponse(content="Paris", stop_reason="end_turn"),
    TokenResponse(content="Berlin", stop_reason="end_turn"),
])

config = LLMConfig(model="mock", provider="mock")

response = await mock.generate_text_response("Capital of France?", config)
assert response.content == "Paris"
assert mock.call_count == 1
assert mock.last_prompt == "Capital of France?"

# InMemoryHistory for conversation testing
from quartermaster_providers.testing import InMemoryHistory

history = InMemoryHistory()
history.add_message("user", "Hello")
history.add_message("assistant", "Hi there!")
assert len(history) == 2

Contributing

Contributions welcome. See CONTRIBUTING.md for guidelines.

License

Apache License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quartermaster_providers-0.3.1.tar.gz (86.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quartermaster_providers-0.3.1-py3-none-any.whl (50.4 kB view details)

Uploaded Python 3

File details

Details for the file quartermaster_providers-0.3.1.tar.gz.

File metadata

  • Download URL: quartermaster_providers-0.3.1.tar.gz
  • Upload date:
  • Size: 86.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for quartermaster_providers-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b6a9e89f29a7bdb3bf7aa440d9151bc1f83ee7c53b0177aa9e7f6cdba636e519
MD5 b91d9f29ca49cc8dae365312ae882b1d
BLAKE2b-256 fea12873fddd16cdd48a0aaa1b1b8ced3f018dd0f2455716d041c183a3550836

See more details on using hashes here.

File details

Details for the file quartermaster_providers-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for quartermaster_providers-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ce1e87a1ab1b38c933e0c8c962e995fc718d871be9826fe40a41328c235e2920
MD5 69626a12ebe2a05bff172611ce653313
BLAKE2b-256 479ee219172ceaebed0f2f183b95ec0a7ebb854e50db350b31234daf9ce104a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page