Model-agnostic LLM execution library

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

benballintyn

These details have not been verified by PyPI

Project description

vox

Model-agnostic LLM execution library for Python. One interface, every provider.

Write your code once and run it against OpenAI, Anthropic, Google Gemini, OpenRouter, or local models via LM Studio — with streaming, tool use, structured output, and reasoning support out of the box.

Installation

# Core library (no provider SDKs)
pip install vox-llm

# With a specific provider
pip install "vox-llm[openai]"
pip install "vox-llm[anthropic]"
pip install "vox-llm[gemini]"

# All providers
pip install "vox-llm[all]"

Note: the PyPI package is vox-llm (the name vox was already taken). The Python import name is still vox — from vox import VoxClient works unchanged.

From GitHub (pinned to a tag):

pip install "vox-llm[all] @ git+https://github.com/benballintyn/vox.git@v0.1.0"

Requires Python 3.11+.

Quick Start

from vox import VoxClient, Message

client = VoxClient(openai_api_key="sk-...")

response = client.complete(
    messages=[Message(role="user", content="What is the speed of light?")],
    model="gpt-4o",
)
print(response.message.text)

Switch providers by changing the model name — no other code changes needed:

# OpenAI
response = client.complete(messages, model="gpt-4o")

# Anthropic
response = client.complete(messages, model="claude-sonnet-4-20250514")

# Gemini
response = client.complete(messages, model="gemini-2.5-pro")

Provider Setup

Pass API keys directly or via environment variables:

client = VoxClient(
    openai_api_key="sk-...",           # or OPENAI_API_KEY env var
    anthropic_api_key="sk-ant-...",    # or ANTHROPIC_API_KEY env var
    gemini_api_key="...",              # or GEMINI_API_KEY env var
    openrouter_api_key="sk-or-...",    # or OPENROUTER_API_KEY env var
    lmstudio_base_url="http://localhost:1234/v1",  # default
)

Provider Auto-Detection

Vox resolves the provider from the model name automatically:

Model prefix	Provider
`gpt-`, `o1`, `o3`, `o4`	OpenAI
`claude-`	Anthropic
`gemini-`	Gemini

For OpenRouter and LM Studio, pass provider= explicitly:

response = client.complete(
    messages=messages,
    model="meta-llama/llama-3-70b",
    provider="openrouter",
)

Per-Provider Configuration

Override defaults with ProviderConfig:

from vox import VoxClient, ProviderConfig

client = VoxClient(
    provider_configs={
        "openai": ProviderConfig(
            api_key="sk-...",
            timeout=60.0,
            max_retries=3,
        ),
        "openrouter": ProviderConfig(
            api_key="sk-or-...",
            app_name="MyApp",           # sent as X-Title header
            app_url="https://myapp.com", # sent as HTTP-Referer header
        ),
    }
)

Completions

Basic

from vox import VoxClient, Message

client = VoxClient(openai_api_key="sk-...")

response = client.complete(
    messages=[
        Message(role="system", content="You are a helpful assistant."),
        Message(role="user", content="Explain quantum entanglement."),
    ],
    model="gpt-4o",
    max_tokens=500,
    temperature=0.7,
)

print(response.message.text)
print(f"Tokens: {response.usage.total_tokens}")

Async

response = await client.acomplete(
    messages=[Message(role="user", content="Hello")],
    model="claude-sonnet-4-20250514",
)

Streaming

for chunk in client.stream(
    messages=[Message(role="user", content="Write a haiku about Python.")],
    model="gpt-4o",
):
    if chunk.type == "text":
        print(chunk.text, end="", flush=True)
    elif chunk.type == "usage":
        print(f"\nTokens: {chunk.usage.total_tokens}")
    elif chunk.type == "done":
        print(f"\nFinish reason: {chunk.finish_reason}")

Async Streaming

async for chunk in client.astream(messages=messages, model="gemini-2.5-pro"):
    if chunk.type == "text":
        print(chunk.text, end="")

Stream Chunk Types

`chunk.type`	Fields	Description
`"text"`	`text`	Content delta
`"tool_call_start"`	`tool_call`	New tool call (id, name, arguments)
`"tool_call_delta"`	`tool_call_id`, `arguments_delta`	Partial JSON for tool arguments
`"thinking"`	`thinking_text`	Reasoning/thinking delta
`"usage"`	`usage`	Final token counts
`"done"`	`finish_reason`	Generation complete

Tool Use (Function Calling)

Define tools, let the model call them, feed results back:

from vox import VoxClient, Message, Tool, ToolResult

client = VoxClient(openai_api_key="sk-...")

# 1. Define tools
tools = [
    Tool(
        name="get_weather",
        description="Get current weather for a city.",
        parameters={
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
            },
            "required": ["city"],
        },
    ),
]

# 2. Send messages with tools
messages = [Message(role="user", content="What's the weather in Tokyo?")]
response = client.complete(messages=messages, model="gpt-4o", tools=tools)

# 3. Handle tool calls
if response.message.tool_calls:
    messages.append(response.message)  # add assistant's tool call message

    for tc in response.message.tool_calls:
        # Execute the function (your code)
        result = get_weather(tc.arguments["city"])

        # Return result to the model
        tool_result = ToolResult(
            tool_call_id=tc.id,
            name=tc.name,
            content=result,
        )
        messages.append(tool_result.to_message())

    # 4. Get final response
    final = client.complete(messages=messages, model="gpt-4o", tools=tools)
    print(final.message.text)

This works identically across OpenAI, Anthropic, Gemini, and OpenRouter — vox translates the tool definitions and results to each provider's native format.

Provider-native (server-side) tools

Some providers offer server-side tools that run on their infrastructure — Anthropic's web_search_20250305, OpenAI's web_search_preview, Gemini's Google Search grounding, and others. These have provider-specific shapes and no cross-provider abstraction, so vox does not model them as a Tool. Instead, the tools list accepts raw dicts alongside vox Tool objects — raw dicts are passed through to the provider verbatim:

response = client.complete(
    messages=[Message(role="user", content="What's the current 10Y JGB yield?")],
    model="claude-sonnet-4-5-20250929",
    tools=[
        my_function_tool,  # vox Tool — translated to the provider's format
        {                  # raw dict — passed through verbatim
            "type": "web_search_20250305",
            "name": "web_search",
            "max_uses": 5,
        },
    ],
)

The caller is responsible for matching the resolved provider's expected schema — a raw dict shaped for one provider won't work on another. An entry that is neither a Tool nor a dict raises a TypeError.

Structured Output

Pass a Pydantic model to get validated, typed responses:

from pydantic import BaseModel
from vox import VoxClient, Message

class MovieReview(BaseModel):
    title: str
    rating: float
    summary: str
    pros: list[str]
    cons: list[str]

client = VoxClient(openai_api_key="sk-...")

response = client.complete(
    messages=[Message(role="user", content="Review the movie Inception.")],
    model="gpt-4o",
    response_schema=MovieReview,
)

review: MovieReview = response.parsed
print(f"{review.title}: {review.rating}/10")
print(f"Pros: {', '.join(review.pros)}")

The schema is automatically converted to each provider's native format:

OpenAI: JSON schema in response_format
Anthropic: Synthetic tool with forced invocation
Gemini: response_schema parameter
OpenRouter/LM Studio: JSON schema in response_format

Reasoning / Thinking

Enable extended reasoning for models that support it:

from vox import VoxClient, Message, ReasoningConfig

client = VoxClient(anthropic_api_key="sk-ant-...")

response = client.complete(
    messages=[Message(role="user", content="Prove that sqrt(2) is irrational.")],
    model="claude-sonnet-4-20250514",
    reasoning=ReasoningConfig(enabled=True, budget_tokens=10000),
)

# Access thinking blocks
if response.thinking:
    for block in response.thinking:
        print(f"[Thinking] {block.text[:200]}...")

print(response.message.text)

Configuration by Provider

Provider	Config	Description
Anthropic	`budget_tokens`	Token budget for extended thinking
OpenAI (o-series)	`level` ("low"/"medium"/"high")	Reasoning effort level
Gemini 2.5	`budget_tokens`	Thinking token budget
Gemini 3+	`level` ("low"/"medium"/"high")	Thinking level

Multimodal (Vision)

Send images alongside text:

from vox import Message, TextContent, ImageContent

message = Message(
    role="user",
    content=[
        TextContent(text="What's in this image?"),
        ImageContent(
            source_type="url",
            media_type="image/jpeg",
            data="https://example.com/photo.jpg",
        ),
    ],
)

response = client.complete(messages=[message], model="gpt-4o")

For base64 images:

import base64

with open("photo.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

message = Message(
    role="user",
    content=[
        TextContent(text="Describe this image."),
        ImageContent(source_type="base64", media_type="image/png", data=b64),
    ],
)

Video input

vox accepts video via a VideoContent part that mirrors ImageContent's shape. Provider routing:

Gemini consumes video natively (inline base64 or hosted URI, including YouTube links — video/mp4, video/webm, etc.).
OpenAI, Anthropic, OpenRouter, LM Studio have no native video input today. vox falls back to client-side frame extraction: it decodes the video, samples a handful of frames at ~1 fps (capped at 8), and substitutes them as ImageContent parts before dispatch. A loud warning is emitted via loguru so the cost implication is visible. Install the extra to enable this: pip install 'vox-llm[video]'. Consumers that want explicit control over sampling should pass ImageContent parts directly.

from pathlib import Path
from vox import Message, TextContent, VideoContent

video = VideoContent(
    source_type="base64",
    media_type="video/mp4",
    data=Path("clip.mp4").read_bytes(),  # raw bytes auto-base64 encoded
)

response = client.complete(
    messages=[
        Message(
            role="user",
            content=[
                TextContent(text="Summarize what happens in this clip."),
                video,
            ],
        )
    ],
    model="gemini-2.5-pro",  # native; or gpt-5-mini for frame-fallback
)

Hosted-URI form (Gemini only — YouTube link or Files-API URI):

VideoContent(
    source_type="url",
    media_type="video/mp4",
    data="https://www.youtube.com/watch?v=...",
)

Audio I/O (transcribe + synthesize)

Audio doesn't fit naturally into the general complete() flow — the flagship reasoning models (Claude Opus / Sonnet, GPT-5, Gemini 3) don't accept audio natively. vox exposes audio through dedicated methods that hit each provider's actual STT / TTS surface:

Provider	`transcribe()` (STT)	`synthesize()` (TTS)
OpenAI	`whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`	`tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`
Gemini	`gemini-3.5-flash`+ (via `generate_content` with audio Part)	`gemini-3.1-flash-tts-preview` (PCM wrapped as WAV)
Anthropic / OpenRouter / LM Studio	raises `InvalidRequestError`	raises `InvalidRequestError`

Transcribe

from pathlib import Path
from vox import AudioContent, VoxClient

client = VoxClient()

result = client.transcribe(
    AudioContent(
        source_type="base64",
        media_type="audio/wav",
        data=Path("meeting.wav").read_bytes(),  # bytes auto-base64 encoded
    ),
    model="whisper-1",
    language="en",            # ISO-639-1; OpenAI only, Gemini ignores
    prompt="meeting notes",   # optional bias prompt (Whisper)
)

print(result.text)
print(result.language, result.duration)  # populated when provider reports

Synthesize

audio = client.synthesize(
    text="The quick brown fox jumps over the lazy dog.",
    voice="alloy",            # provider-specific values
    model="tts-1",            # or gpt-4o-mini-tts, gemini-3.1-flash-tts-preview
    format="mp3",             # OpenAI: mp3/opus/aac/flac/wav/pcm; Gemini: always wav
    speed=1.0,                # OpenAI only
)

Path("out.mp3").write_bytes(base64.standard_b64decode(audio.data))

Available voices:

OpenAI (vox.providers.openai.OPENAI_TTS_VOICES): alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar (marin / cedar are highest quality).
Gemini (vox.providers.gemini.GEMINI_TTS_VOICES): Aoede, Charon, Fenrir, Kore, Leda, Orus, Puck, Zephyr.

Async variants (atranscribe, asynthesize) mirror the sync API.

Retries

vox retries transient provider errors automatically. The default policy is 3 retries with exponential backoff and jitter, honouring any retry_after value the provider returns on a RateLimitError.

from vox import RetryPolicy, VoxClient

client = VoxClient(
    retry_policy=RetryPolicy(
        max_retries=5,          # up to 5 retries after the initial call
        base_delay=1.0,         # first retry waits ~1s, then ~2s, ~4s, ...
        max_delay=30.0,         # cap any single sleep
        exponential_factor=2.0,
        jitter=0.25,            # ±25% randomization to avoid thundering herd
    )
)

Per-call override on any method:

client.complete(
    messages,
    model="gpt-5",
    retry_policy=RetryPolicy(max_retries=0),  # disable retries for this call
)

What gets retried. Only RateLimitError and ProviderError by default — these are the transient-by-nature ones. InvalidRequestError, AuthenticationError, ContentFilterError, ModelNotFoundError, and non-vox exceptions propagate immediately. Customize the whitelist via RetryPolicy(retry_on=(...)).

Streaming. Retries only fire before the first chunk is yielded. Once data has started arriving, errors propagate as-is — replaying a partial stream would surprise the consumer.

retry_after precedence. When a RateLimitError carries a server- supplied retry_after, vox uses that value (capped by max_delay) instead of the computed backoff.

Callbacks (Observability Hooks)

Wire telemetry — OpenTelemetry, Langfuse, Helicone, custom logging, whatever — without monkey-patching, via the CallbackHandler protocol. Pass any number of handlers to VoxClient(callbacks=[...]) and vox fires them around every call.

from vox import CallbackHandler, LoggingHandler, VoxClient

client = VoxClient(
    callbacks=[LoggingHandler()],   # built-in: logs every call via loguru
    capture_content=False,          # default: no PII in event payloads
)

Three events per call lifecycle:

Event	When	Payload
`on_request(RequestEvent)`	Before the provider call	`model`, `provider`, `method`, `request_kwargs`
`on_response(ResponseEvent)`	After a successful response	`model`, `provider`, `method`, `duration_ms`, `usage`, `response`
`on_error(ErrorEvent)`	After a failed call (post-retry)	`model`, `provider`, `method`, `duration_ms`, `error`

Custom handlers implement any subset of the methods:

class CostBudgetTracker:
    def __init__(self) -> None:
        self.spend_usd = 0.0

    def on_response(self, event):
        if event.usage and event.usage.estimated_cost:
            self.spend_usd += event.usage.estimated_cost

tracker = CostBudgetTracker()
client = VoxClient(callbacks=[tracker])

OpenTelemetry without depending on `opentelemetry-api`

Each event ships a to_otel_attributes() helper that returns a dict keyed by the OpenTelemetry GenAI semantic conventions (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, etc.). Consumers wiring vox into OTel get clean spans with the standard attribute names with one line — vox itself stays dependency-free.

from opentelemetry import trace

class OTelHandler:
    def on_request(self, event):
        span = trace.get_current_span()
        span.set_attributes(event.to_otel_attributes())

    def on_response(self, event):
        span = trace.get_current_span()
        span.set_attributes(event.to_otel_attributes())

Behaviour

No PII by default. request_kwargs strips messages / audio / text / prompt; response is set to None. Pass VoxClient(capture_content=True) to include the full payloads when every handler in the list is trusted with sensitive data.
Handler exceptions are swallowed at WARNING level via loguru. A buggy telemetry handler never breaks the real LLM call.
Async paths use a thread executor. From acomplete / astream / atranscribe / asynthesize, vox dispatches each handler call via loop.run_in_executor and returns immediately — a slow handler doing blocking I/O won't stall the response.

Error Handling

All provider errors are normalized to a consistent hierarchy:

from vox.errors import (
    VoxError,              # base class
    AuthenticationError,   # invalid/missing API key
    RateLimitError,        # rate limited (has .retry_after)
    QuotaExceededError,    # billing/quota limit
    InvalidRequestError,   # malformed request
    ProviderError,         # server error (5xx)
    ContentFilterError,    # safety system blocked content
    ModelNotFoundError,    # model doesn't exist
)

try:
    response = client.complete(messages=messages, model="gpt-4o")
except RateLimitError as e:
    print(f"Rate limited by {e.provider}, retry after {e.retry_after}s")
except AuthenticationError as e:
    print(f"Auth failed for {e.provider}: {e}")
except VoxError as e:
    print(f"LLM error: {e}")

API Reference

VoxClient

VoxClient(
    openai_api_key: str | None = None,
    anthropic_api_key: str | None = None,
    gemini_api_key: str | None = None,
    openrouter_api_key: str | None = None,
    lmstudio_base_url: str = "http://localhost:1234/v1",
    openrouter_app_name: str | None = None,
    openrouter_app_url: str | None = None,
    provider_configs: dict[str, ProviderConfig] | None = None,
)

Methods

Method	Signature	Returns
`complete()`	`(messages, model, , provider, max_tokens, temperature, tools, response_schema, reasoning, stop, *kwargs)`	`CompletionResponse`
`acomplete()`	Same as above	`CompletionResponse` (async)
`stream()`	Same as above	`Iterator[StreamChunk]`
`astream()`	Same as above	`AsyncIterator[StreamChunk]`

CompletionResponse

Field	Type	Description
`message`	`Message`	Assistant's response message
`usage`	`Usage`	Token counts
`provider`	`str`	Provider name
`model`	`str`	Model used
`finish_reason`	`str \| None`	Why generation stopped
`thinking`	`list[ThinkingBlock] \| None`	Reasoning blocks
`parsed`	`Any`	Validated Pydantic instance (when `response_schema` used)

Message

Field	Type	Description
`role`	`"system" \| "user" \| "assistant" \| "tool"`	Message role
`content`	`str \| list[ContentPart]`	Text or multimodal content
`tool_calls`	`list[ToolCallData] \| None`	Tool calls (assistant messages)
`tool_call_id`	`str \| None`	Tool result reference
`name`	`str \| None`	Tool name (for tool messages)

Property: .text — extracts plain text from any content format.

Tool

Tool(
    name: str,              # Function name
    description: str,       # What the function does
    parameters: dict,       # JSON Schema for arguments
)

ToolResult

ToolResult(
    tool_call_id: str,      # ID from ToolCallData
    name: str,              # Tool name
    content: str,           # Result content
    is_error: bool = False, # Whether execution failed
)

Method: .to_message() — converts to a Message with role="tool".

Usage

Field	Type	Description
`prompt_tokens`	`int`	Input tokens
`completion_tokens`	`int`	Output tokens
`total_tokens`	`int`	Total tokens
`reasoning_tokens`	`int`	Reasoning/thinking tokens
`cache_read_tokens`	`int`	Prompt cache hits
`cache_creation_tokens`	`int`	Prompt cache writes

ProviderConfig

ProviderConfig(
    api_key: str | None = None,
    base_url: str | None = None,
    default_model: str | None = None,
    app_name: str | None = None,     # OpenRouter: X-Title header
    app_url: str | None = None,      # OpenRouter: HTTP-Referer header
    timeout: float = 120.0,
    max_retries: int = 2,
)

ReasoningConfig

ReasoningConfig(
    enabled: bool = True,
    budget_tokens: int | None = None,   # Anthropic, Gemini 2.5
    level: str | None = None,           # "low" | "medium" | "high" — OpenAI o-series, Gemini 3+
)

LM Studio (Local Models)

Run models locally with LM Studio:

client = VoxClient(lmstudio_base_url="http://localhost:1234/v1")

response = client.complete(
    messages=[Message(role="user", content="Hello!")],
    model="local-model",
    provider="lmstudio",
)

Make sure LM Studio is running with a model loaded. The default base URL is http://localhost:1234/v1.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

benballintyn

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.6.0

May 28, 2026

0.5.0

May 27, 2026

0.4.0

May 27, 2026

0.3.1

May 27, 2026

0.3.0

May 26, 2026

0.2.0

May 26, 2026

0.1.2

May 26, 2026

0.1.1

May 22, 2026

0.1.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vox_llm-0.6.0.tar.gz (73.4 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vox_llm-0.6.0-py3-none-any.whl (79.6 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file vox_llm-0.6.0.tar.gz.

File metadata

Download URL: vox_llm-0.6.0.tar.gz
Upload date: May 28, 2026
Size: 73.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vox_llm-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`4090188b49077dd18b0eb1fe5c0aa491c2ef3c2b3a6e3fd114507cd39f48850a`
MD5	`70dd45418854663135dbd2d1505925eb`
BLAKE2b-256	`aad804fc6bcbf41c5aeb353bd300879f8aabfbea268b928063b0440207a66516`

See more details on using hashes here.

File details

Details for the file vox_llm-0.6.0-py3-none-any.whl.

File metadata

Download URL: vox_llm-0.6.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 79.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vox_llm-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a16d69dad9018b50ce17d740e728f1eb375c42e1273abcdecc902e76a46c0f8d`
MD5	`37982ee42db3eca2d299f0a8610ae19a`
BLAKE2b-256	`5b2eaa118c286d9e256c7d548b35d3b245271635ed76e3b7f8f52a9290137045`

See more details on using hashes here.

vox-llm 0.6.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

vox

Installation

Quick Start

Provider Setup

Provider Auto-Detection

Per-Provider Configuration

Completions

Basic

Async

Streaming

Async Streaming

Stream Chunk Types

Tool Use (Function Calling)

Provider-native (server-side) tools

Structured Output

Reasoning / Thinking

Configuration by Provider

Multimodal (Vision)

Video input

Audio I/O (transcribe + synthesize)

Transcribe

Synthesize

Retries

Callbacks (Observability Hooks)

OpenTelemetry without depending on opentelemetry-api

Behaviour

Error Handling

API Reference

VoxClient

Methods

CompletionResponse

Message

Tool

ToolResult

Usage

ProviderConfig

ReasoningConfig

LM Studio (Local Models)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

OpenTelemetry without depending on `opentelemetry-api`