A Python SDK for Inference Gateway
Project description
🚀 Inference Gateway Python SDK
A modern and easy-to-use Python SDK for the Inference Gateway
Connect to multiple LLM providers through a unified interface • Stream responses • Function calling • Vision support • MCP tools support • Pydantic validation
- 🚀 Inference Gateway Python SDK
Installation
To install the SDK, use pip:
pip install inference-gateway
Requires Python 3.12+.
Usage
Creating a Client
To create a client, instantiate InferenceGatewayClient:
from inference_gateway import InferenceGatewayClient, Message
client = InferenceGatewayClient("http://localhost:8080/v1")
The client also supports authentication, custom timeouts, and an optional httpx backend:
# With authentication
client = InferenceGatewayClient(
"http://localhost:8080/v1",
token="your-api-token",
timeout=60.0,
)
# Using httpx instead of the default requests backend
client = InferenceGatewayClient(
"http://localhost:8080/v1",
use_httpx=True,
)
# Use as a context manager to ensure the underlying HTTP client is closed
with InferenceGatewayClient("http://localhost:8080/v1") as client:
models = client.list_models()
Listing Models
To list available models, use the list_models method:
# List all models from all providers
models = client.list_models()
print("All available models:", models)
# List models for a specific provider
openai_models = client.list_models(provider="openai")
print("OpenAI models:", openai_models)
Listing MCP Tools
To list available MCP (Model Context Protocol) tools, use the list_tools method. This functionality is only available when MCP_ENABLE and MCP_EXPOSE are set on the Inference Gateway server:
tools = client.list_tools()
print(f"Found {len(tools.data)} MCP tools:")
for tool in tools.data:
print(f"- {tool.name}: {tool.description} (Server: {tool.server})")
Note: The MCP tools endpoint requires authentication and is only accessible when the server has
MCP_EXPOSE=trueconfigured.
Server-Side Tool Management
The SDK currently supports listing available MCP tools, which is particularly useful for UI applications that need to display connected tools to users. The key advantage is that tools are managed server-side:
- Automatic Tool Injection: Tools are automatically inferred and injected into requests by the Inference Gateway server
- Simplified Client Code: No need to manually manage or configure tools in your client application
- Transparent Tool Calls: During streaming chat completions with configured MCP servers, tool calls appear in the response stream - no special handling required except optionally displaying them to users
Generating Content
To generate content using a model, use the create_chat_completion method:
Note: Some models support reasoning capabilities. You can use the
reasoning_formatparameter to control how reasoning is provided in the response. The model's reasoning will be available in thereasoningorreasoning_contentfields of the response message.
from inference_gateway import InferenceGatewayClient, Message
client = InferenceGatewayClient("http://localhost:8080/v1")
response = client.create_chat_completion(
model="ollama/llama2",
messages=[
Message(role="system", content="You are a helpful assistant."),
Message(role="user", content="What is Python?"),
],
)
print(response.choices[0].message.content.root)
# If reasoning was requested and the model supports it
if response.choices[0].message.reasoning:
print("Reasoning:", response.choices[0].message.reasoning)
Vision Support
The SDK supports multimodal messages with images for vision-capable models like GPT-4o. You can include images via URLs or base64-encoded data URLs.
Simple Text Message
from inference_gateway import InferenceGatewayClient, Message
client = InferenceGatewayClient("http://localhost:8080/v1")
response = client.create_chat_completion(
model="openai/gpt-4o",
messages=[Message(role="user", content="What is the Python programming language?")],
)
Vision Message with Image URL
from inference_gateway import (
InferenceGatewayClient,
Message,
TextContentPart,
ImageContentPart,
ImageURL,
)
client = InferenceGatewayClient("http://localhost:8080/v1")
response = client.create_chat_completion(
model="openai/gpt-4o",
messages=[
Message(
role="user",
content=[
TextContentPart(type="text", text="What is in this image?"),
ImageContentPart(
type="image_url",
image_url=ImageURL(
url="https://example.com/image.jpg",
detail="auto",
),
),
],
)
],
)
Vision Message with Base64 Encoded Image
from inference_gateway import ImageContentPart, ImageURL
ImageContentPart(
type="image_url",
image_url=ImageURL(
url="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD...",
detail="high", # better quality, more expensive
),
)
Multiple Images in One Message
Message(
role="user",
content=[
TextContentPart(type="text", text="Compare these images:"),
ImageContentPart(type="image_url", image_url=ImageURL(url="https://example.com/image1.jpg")),
ImageContentPart(type="image_url", image_url=ImageURL(url="https://example.com/image2.jpg")),
],
)
Image Detail Levels:
"auto": Automatic detail level (default)"low": Lower resolution, faster and cheaper"high": Higher resolution, better quality but more expensive
For a complete example, see the chat example.
Using ReasoningFormat
You can enable reasoning capabilities by setting the reasoning_format parameter in your request:
from inference_gateway import InferenceGatewayClient, Message
client = InferenceGatewayClient("http://localhost:8080/v1")
response = client.create_chat_completion(
model="anthropic/claude-3-opus-20240229",
messages=[
Message(role="system", content="You are a helpful assistant. Please include your reasoning for complex questions."),
Message(role="user", content="What is the square root of 144 and why?"),
],
reasoning_format="parsed", # "raw" or "parsed" - defaults to "parsed"
)
print("Content:", response.choices[0].message.content.root)
if response.choices[0].message.reasoning:
print("Reasoning:", response.choices[0].message.reasoning)
Streaming Content
To generate content using streaming mode, use the create_chat_completion_stream method. It yields SSEvent objects:
import json
from pydantic import ValidationError
from inference_gateway import InferenceGatewayClient, Message
from inference_gateway.models import CreateChatCompletionStreamResponse
client = InferenceGatewayClient("http://localhost:8080/v1")
for chunk in client.create_chat_completion_stream(
model="ollama/llama2",
messages=[
Message(role="system", content="You are a helpful assistant."),
Message(role="user", content="Tell me a story."),
],
):
if not chunk.data:
continue
try:
data = json.loads(chunk.data)
stream_response = CreateChatCompletionStreamResponse.model_validate(data)
except (json.JSONDecodeError, ValidationError):
continue
for choice in stream_response.choices:
# Reasoning content (both reasoning and reasoning_content fields)
if choice.delta.reasoning:
print(f"💭 Reasoning: {choice.delta.reasoning}")
if choice.delta.reasoning_content:
print(f"💭 Reasoning: {choice.delta.reasoning_content}")
if choice.delta.content:
print(choice.delta.content, end="", flush=True)
Tool-Use
To use tools with the SDK, define a tool with the type-safe Pydantic models and pass it to the request:
from inference_gateway import InferenceGatewayClient, Message
from inference_gateway.models import ChatCompletionTool, FunctionObject, FunctionParameters
client = InferenceGatewayClient("http://localhost:8080/v1")
tools = [
ChatCompletionTool(
type="function",
function=FunctionObject(
name="get_current_weather",
description="Get the current weather in a given location",
parameters=FunctionParameters(
type="object",
properties={
"location": {
"type": "string",
"enum": ["san francisco", "new york", "london", "tokyo", "sydney"],
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use",
},
},
required=["location"],
),
),
),
ChatCompletionTool(
type="function",
function=FunctionObject(
name="get_current_time",
description="Get the current time in a given location",
parameters=FunctionParameters(
type="object",
properties={
"location": {
"type": "string",
"enum": ["san francisco", "new york", "london", "tokyo", "sydney"],
"description": "The city and state, e.g. San Francisco, CA",
},
},
required=["location"],
),
),
),
]
response = client.create_chat_completion(
model="openai/gpt-4o",
messages=[
Message(role="system", content="You are a helpful assistant with access to weather and time information."),
Message(role="user", content="What is the weather like in New York?"),
],
tools=tools,
)
# Inspect any tool calls made by the model
if response.choices[0].message.tool_calls:
for tool_call in response.choices[0].message.tool_calls:
print(f"Tool called: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Provider-Specific Tool-Call Metadata
Some providers attach opaque, per-call metadata that must be echoed back on follow-up requests. The most notable case is Google Gemini's reasoning models, which return a thought_signature on each tool call - the next request must round-trip it verbatim or the provider will reject it.
The SDK preserves this automatically as long as you append the assistant message back to the conversation as a model object (rather than reconstructing it from a dict):
response = client.create_chat_completion(
model="google/gemini-3-pro",
messages=messages,
tools=tools,
)
assistant_message = response.choices[0].message
messages.append(assistant_message) # preserves extra_content.google.thought_signature
# ... append your tool results, then send the follow-up request ...
If you need to construct one explicitly:
from inference_gateway import Google, ToolCallExtraContent
extra = ToolCallExtraContent(google=Google(thought_signature="..."))
The field is fully optional - providers that don't use it ignore it entirely, and model_dump(exclude_none=True) strips it from the wire when unset.
Proxy Requests
To proxy a raw request directly to a provider's API through the gateway, use proxy_request:
response = client.proxy_request(
provider="openai",
path="/v1/models",
method="GET",
)
print("OpenAI models:", response)
Health Check
To check if the API is healthy:
if client.health_check():
print("API is healthy")
else:
print("API is unavailable")
Error Handling
The SDK provides several exception types:
from inference_gateway import (
InferenceGatewayError,
InferenceGatewayAPIError,
InferenceGatewayValidationError,
)
try:
response = client.create_chat_completion(...)
except InferenceGatewayAPIError as e:
print(f"API Error: {e} (Status: {e.status_code})")
print("Response:", e.response_data)
except InferenceGatewayValidationError as e:
print(f"Validation Error: {e}")
except InferenceGatewayError as e:
print(f"General Error: {e}")
Examples
For more detailed examples and use cases, check out the examples directory. The examples include:
- List Example - How to list available models
- Chat Example - Basic and advanced chat completion examples
- Tools Example - Function calling and tool usage
- MCP Example - Model Context Protocol integration examples
Each example includes its own README with specific instructions and explanations.
Supported Providers
The SDK supports the following LLM providers:
- Ollama (
"ollama") - Ollama Cloud (
"ollama_cloud") - Groq (
"groq") - OpenAI (
"openai") - DeepSeek (
"deepseek") - Cloudflare (
"cloudflare") - Cohere (
"cohere") - Anthropic (
"anthropic") - Google (
"google") - Mistral AI (
"mistral") - Moonshot (
"moonshot")
License
This SDK is distributed under the Apache 2.0 License, see LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inference_gateway-0.7.1.tar.gz.
File metadata
- Download URL: inference_gateway-0.7.1.tar.gz
- Upload date:
- Size: 31.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55fdb9dcaffd3fd05bea5193eaf656ea4c36b6ba01698316691fa202a50e19a9
|
|
| MD5 |
9e22e874596276d3783e80b2d7f76b7e
|
|
| BLAKE2b-256 |
d0ab8b54ba13da4646f3829680fe8c4f9538139674ff29d61b6fdce45c07a220
|
Provenance
The following attestation bundles were made for inference_gateway-0.7.1.tar.gz:
Publisher:
release.yml on inference-gateway/python-sdk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inference_gateway-0.7.1.tar.gz -
Subject digest:
55fdb9dcaffd3fd05bea5193eaf656ea4c36b6ba01698316691fa202a50e19a9 - Sigstore transparency entry: 1740303290
- Sigstore integration time:
-
Permalink:
inference-gateway/python-sdk@7f954d3cf10c2c543b16c4dd1fd78689e08e04af -
Branch / Tag:
refs/heads/main - Owner: https://github.com/inference-gateway
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7f954d3cf10c2c543b16c4dd1fd78689e08e04af -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file inference_gateway-0.7.1-py3-none-any.whl.
File metadata
- Download URL: inference_gateway-0.7.1-py3-none-any.whl
- Upload date:
- Size: 23.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
622dd646d8e2ebe5f9e46f95a6b238cf32c516341f52814454acaff6018dd490
|
|
| MD5 |
ee8047a2b2ef8e71f20a53533ab5d8ca
|
|
| BLAKE2b-256 |
719d99097a5d052a51630ae8958956d676a8a792a6bfc2d87dceb8d73a1d5b7b
|
Provenance
The following attestation bundles were made for inference_gateway-0.7.1-py3-none-any.whl:
Publisher:
release.yml on inference-gateway/python-sdk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inference_gateway-0.7.1-py3-none-any.whl -
Subject digest:
622dd646d8e2ebe5f9e46f95a6b238cf32c516341f52814454acaff6018dd490 - Sigstore transparency entry: 1740303339
- Sigstore integration time:
-
Permalink:
inference-gateway/python-sdk@7f954d3cf10c2c543b16c4dd1fd78689e08e04af -
Branch / Tag:
refs/heads/main - Owner: https://github.com/inference-gateway
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7f954d3cf10c2c543b16c4dd1fd78689e08e04af -
Trigger Event:
workflow_dispatch
-
Statement type: