Skip to main content

Autourgos LLM wrapper for the OpenAI Responses API

Project description

autourgos-responses

LLM wrapper for the OpenAI Responses API (client.responses.create), part of the Autourgos framework.

Fully self-contained — no autourgos-core dependency required. Just pip install openai and you are ready.

The Responses API is OpenAI's newer, stateful endpoint that supports reasoning models (o3, o3-mini, o1), built-in tools, and multi-turn input natively.


Why use this?

Almost every major LLM provider today — Groq, Together AI, Mistral, Perplexity, DeepSeek, Ollama, LM Studio, vLLM, Azure OpenAI — exposes an OpenAI-compatible API. This means they all accept the same request format.

autourgos-responses takes advantage of this. You set base_url to any provider's endpoint and model to whatever model they offer. One package, any LLM. You never have to learn a new SDK or rewrite your code when you switch providers.

The Responses API gives you extra power on top: native reasoning models (o3, o3-mini, o1) with configurable thinking effort, text verbosity control, and cleaner multi-turn conversation handling.

OpenAI (gpt-4o, o3, o3-mini, o1) ──────────┐
Groq (Llama, Mixtral, Gemma) ───────────────┤
Together AI (70B, 8x7B, ...) ───────────────┤  autourgos-responses
Mistral AI (mistral-large, ...) ────────────┤  (one interface)
DeepSeek (deepseek-chat, ...) ──────────────┤
Perplexity (sonar models) ──────────────────┤
Ollama — any local model ───────────────────┤
LM Studio — any local model ────────────────┤
vLLM — self-hosted ─────────────────────────┤
Azure OpenAI ───────────────────────────────┘

Table of Contents


Install

pip install autourgos-responses

Requires Python 3.10+ and openai>=1.0.0.


Works With Any LLM

All you need to switch providers is base_url and the right model name. Your API key comes from the provider you choose.

OpenAI (default)

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    api_key="sk-...",           # or set OPENAI_API_KEY env var
)
reply = llm.invoke("What is the capital of France?")
print(reply)
# Paris

OpenAI reasoning models

These are special to OpenAI's Responses API. They support reasoning_effort to control how long the model thinks before answering.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="o3-mini",
    api_key="sk-...",
    reasoning_effort="high",   # "low", "medium", or "high"
)
reply = llm.invoke("Prove that the square root of 2 is irrational.")
print(reply)
# Assume for contradiction that √2 = p/q in lowest terms...

Groq — fastest inference, free tier available

Groq runs open-source models (Llama 3, Mixtral, Gemma) at extremely high speed. Get your key at https://console.groq.com.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="llama3-70b-8192",
    api_key="gsk_...",          # Groq API key
    base_url="https://api.groq.com/openai/v1",
)
reply = llm.invoke("Explain quantum entanglement simply.")
print(reply)
# Quantum entanglement is when two particles become linked so that
# the state of one instantly affects the other, no matter how far apart they are.

Other Groq models: llama3-8b-8192, mixtral-8x7b-32768, gemma2-9b-it

Together AI — wide model selection

Together AI hosts hundreds of open-source models. Get your key at https://api.together.xyz.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="meta-llama/Llama-3-70b-chat-hf",
    api_key="...",              # Together AI key
    base_url="https://api.together.xyz/v1",
)
reply = llm.invoke("Write a Python function to check if a number is prime.")
print(reply)
# def is_prime(n: int) -> bool:
#     if n < 2:
#         return False
#     for i in range(2, int(n**0.5) + 1):
#         if n % i == 0:
#             return False
#     return True

Other Together AI models: mistralai/Mixtral-8x7B-Instruct-v0.1, Qwen/Qwen2-72B-Instruct

Mistral AI

Get your key at https://console.mistral.ai.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="mistral-large-latest",
    api_key="...",              # Mistral API key
    base_url="https://api.mistral.ai/v1",
)
reply = llm.invoke("What are the benefits of test-driven development?")
print(reply)
# TDD helps you write cleaner code, catch bugs early, and gives
# you confidence to refactor without breaking existing behaviour.

Other Mistral models: mistral-medium-latest, mistral-small-latest, open-mixtral-8x7b

DeepSeek

Get your key at https://platform.deepseek.com.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="deepseek-chat",
    api_key="...",              # DeepSeek API key
    base_url="https://api.deepseek.com/v1",
)
reply = llm.invoke("What is a transformer neural network?")
print(reply)
# A transformer is a neural network architecture that uses self-attention
# to process input sequences in parallel, making it highly effective for
# NLP tasks like translation, summarisation, and text generation.

Other DeepSeek models: deepseek-reasoner

Perplexity — web-connected models

Perplexity's Sonar models can search the web in real time. Get your key at https://www.perplexity.ai/settings/api.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="llama-3.1-sonar-large-128k-online",
    api_key="pplx-...",        # Perplexity API key
    base_url="https://api.perplexity.ai",
)
reply = llm.invoke("What are the top AI news stories today?")
print(reply)
# Today's top AI stories include...

Ollama — run any model locally, no internet needed

Ollama runs models entirely on your machine. Install from https://ollama.com, then pull a model:

ollama pull llama3

No API key needed for local use.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="llama3",
    api_key="ollama",           # can be any string — Ollama ignores it
    base_url="http://localhost:11434/v1",
)
reply = llm.invoke("What is the difference between RAM and ROM?")
print(reply)
# RAM (Random Access Memory) is fast, temporary storage your computer uses
# while running programs. ROM (Read-Only Memory) is permanent storage that
# holds firmware your computer needs to boot up.

Other Ollama models: mistral, phi3, gemma2, codellama, qwen2 — anything you pull with ollama pull.

LM Studio — local models with a GUI

LM Studio lets you download and run GGUF models locally. Start the local server in LM Studio, then:

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="local-model",        # use whatever model name LM Studio shows
    api_key="lm-studio",        # any string — ignored locally
    base_url="http://localhost:1234/v1",
)
reply = llm.invoke("Explain recursion in simple terms.")
print(reply)
# Recursion is when a function calls itself to solve a smaller version
# of the same problem, until it reaches a base case that stops the loop.

vLLM — self-hosted high-throughput serving

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    api_key="EMPTY",            # vLLM default when no auth is set
    base_url="http://your-server:8000/v1",
)
reply = llm.invoke("What is the capital of Japan?")
print(reply)
# Tokyo

Azure OpenAI

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",             # your deployment name in Azure
    api_key="...",              # Azure OpenAI key
    base_url="https://<your-resource>.openai.azure.com/openai/deployments/gpt-4o",
)
reply = llm.invoke("What is cloud computing?")
print(reply)
# Cloud computing is the delivery of computing services over the internet
# on a pay-as-you-go basis.

Switching providers at runtime

from autourgos_responses import OpenAIResponse

PROVIDERS = {
    "openai": {
        "model": "gpt-4o-mini",
        "api_key": "sk-...",
        "base_url": None,
    },
    "groq": {
        "model": "llama3-8b-8192",
        "api_key": "gsk_...",
        "base_url": "https://api.groq.com/openai/v1",
    },
    "ollama": {
        "model": "llama3",
        "api_key": "ollama",
        "base_url": "http://localhost:11434/v1",
    },
}

for name, cfg in PROVIDERS.items():
    llm = OpenAIResponse(**cfg)
    reply = llm.invoke("Say hello in one word.")
    print(f"{name}: {reply}")

# openai: Hello!
# groq:   Hello!
# ollama: Hello!

Quick Start

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")
reply = llm.invoke("What is the capital of France?")
print(reply)
# Paris

Basic Text Generation

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    api_key="sk-...",        # or set OPENAI_API_KEY env var
    temperature=0.7,
    max_tokens=256,
)

reply = llm.invoke("Explain machine learning in one sentence.")
print(reply)
# Machine learning is a branch of AI where systems learn from data
# to make predictions or decisions without being explicitly programmed.

Async Generation

import asyncio
from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

async def main():
    reply = await llm.ainvoke("What is the speed of light?")
    print(reply)
    # The speed of light in a vacuum is approximately 299,792,458 metres per second.

asyncio.run(main())

Streaming

Stream the response token by token synchronously.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

for chunk in llm.stream("Write a haiku about mountains."):
    print(chunk, end="", flush=True)

# Silent peaks above,
# Clouds drift through the ancient stone,
# Eagles trace the wind.

You can also enable streaming at construction time so invoke() internally streams and returns the full joined text:

llm = OpenAIResponse(model="gpt-4o", streaming=True)
reply = llm.invoke("Tell me a fun fact about space.")
print(reply)
# A day on Venus is longer than a year on Venus — it takes 243 Earth days
# to rotate once but only 225 Earth days to orbit the Sun.

Async Streaming

import asyncio
from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

async def main():
    async for chunk in llm.astream("Count prime numbers up to 20."):
        print(chunk, end="", flush=True)
    # 2, 3, 5, 7, 11, 13, 17, 19

asyncio.run(main())

Batch Invocation

Synchronous (sequential)

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o-mini")

prompts = [
    "Capital of Japan?",
    "Capital of Germany?",
    "Capital of Brazil?",
]

results = llm.batch_invoke(prompts)
for prompt, result in zip(prompts, results):
    print(f"{prompt} -> {result}")

# Capital of Japan?   -> Tokyo
# Capital of Germany? -> Berlin
# Capital of Brazil?  -> Brasilia

Async (concurrent)

import asyncio
from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o-mini")

async def main():
    results = await llm.abatch_invoke([
        "Capital of Japan?",
        "Capital of Germany?",
        "Capital of Brazil?",
    ])
    print(results)
    # ['Tokyo', 'Berlin', 'Brasilia']

asyncio.run(main())

System Instruction

Set a persistent system prompt sent as the instructions field of every request.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    system_instruction="You are a concise assistant. Always reply in exactly one sentence.",
)

reply = llm.invoke("What is photosynthesis?")
print(reply)
# Photosynthesis is the process by which plants use sunlight, water, and CO2
# to produce glucose and oxygen.

Prompt Templates

Define a reusable template with {placeholders} and fill them at call time.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    prompt_template="Summarise the following {topic} in {num_words} words:\n\n{content}",
)

reply = llm.invoke(prompt_variables={
    "topic": "article",
    "num_words": "30",
    "content": "Quantum computing uses quantum bits (qubits) that can exist in superposition...",
})
print(reply)
# Quantum computing uses qubits in superposition to perform many calculations
# simultaneously, offering vastly superior speeds for specific complex problems
# like cryptography and molecular simulation.

Missing variables raise a clear error:

llm.invoke(prompt_variables={"topic": "article"})
# ValueError: Missing prompt template variables: content, num_words

Reasoning Models

o3, o3-mini, and o1 are OpenAI's reasoning models. They support reasoning_effort to control how long the model thinks before answering. Higher effort produces better answers for hard problems but takes longer and costs more.

Reasoning models are only available from OpenAI. When using other providers, omit reasoning_effort.

reasoning_effort

Valid values: "low", "medium", "high".

from autourgos_responses import OpenAIResponse

# Low effort — fast, cheaper
llm = OpenAIResponse(model="o3-mini", reasoning_effort="low")
reply = llm.invoke("What is 17 × 23?")
print(reply)
# 391

# Medium effort — balanced
llm = OpenAIResponse(model="o3-mini", reasoning_effort="medium")
reply = llm.invoke("Solve: if a train travels at 80 km/h for 2.5 hours, how far does it go?")
print(reply)
# The train travels 200 km. (80 km/h × 2.5 h = 200 km)

# High effort — most thorough, best for hard problems
llm = OpenAIResponse(model="o3", reasoning_effort="high")
reply = llm.invoke(
    "Prove that the square root of 2 is irrational."
)
print(reply)
# Assume for contradiction that √2 = p/q where p and q are integers with no common factors...

When to use each level

effort Use for Speed Cost
"low" Simple maths, factual Q&A, quick summaries Very fast Lowest
"medium" Multi-step reasoning, code generation Moderate Medium
"high" Hard proofs, complex analysis, frontier research Slow Highest

Invalid effort raises immediately

OpenAIResponse(model="o3-mini", reasoning_effort="ultra")
# ValueError: Invalid reasoning_effort 'ultra'. Must be one of: ['high', 'low', 'medium']

Multi-Modal Vision Input

Pass image files, URLs, or raw bytes alongside text.

Note: vision support depends on the provider and model. GPT-4o, LLaVA (Ollama), and several others support it.

From a file path

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")
reply = llm.invoke("What objects are in this image?", files=["photo.jpg"])
print(reply)
# The image shows a wooden desk with a laptop, a coffee mug, and an open notebook.

From a URL

reply = llm.invoke(
    "Describe this chart in detail.",
    files=["https://example.com/sales-chart.png"],
)
print(reply)
# The chart is a bar graph comparing quarterly revenue across four product lines.
# Q3 shows the highest sales at approximately $2.4M for Product A...

From raw bytes

with open("diagram.png", "rb") as f:
    image_bytes = f.read()

reply = llm.invoke("Explain this architecture diagram.", files=[image_bytes])
print(reply)
# The diagram shows a microservices architecture with an API gateway at the top
# routing requests to three downstream services: Auth, Orders, and Payments...

Multiple images

reply = llm.invoke(
    "Which image shows more people?",
    files=["crowd1.jpg", "crowd2.jpg"],
)
print(reply)
# The first image shows more people — it appears to be a large outdoor concert
# with thousands of attendees, while the second shows a small group of around 20.

Structured Output

Return data that matches a Pydantic model automatically.

from pydantic import BaseModel, Field
from autourgos_responses import OpenAIResponse

class WeatherReport(BaseModel):
    city: str = Field(description="Name of the city")
    temperature_celsius: float = Field(description="Current temperature in Celsius")
    condition: str = Field(description="Weather condition e.g. Sunny, Rainy")
    humidity_percent: int = Field(description="Humidity percentage 0-100")

llm = OpenAIResponse(model="gpt-4o", response_schema=WeatherReport)
result = llm.invoke("Describe a typical summer day in London.")

import json
data = json.loads(result["response"])
print(data)
# {
#   "city": "London",
#   "temperature_celsius": 22.0,
#   "condition": "Partly Cloudy",
#   "humidity_percent": 65
# }

Use a plain dict schema instead of Pydantic:

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age":  {"type": "integer"},
    },
    "required": ["name", "age"],
}

llm = OpenAIResponse(model="gpt-4o", response_schema=schema)
result = llm.invoke("Invent a fictional person.")
print(result["response"])
# {"name": "Mira Caldwell", "age": 34}

JSON Mode

Force valid JSON output without defining a schema.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    response_mime_type="application/json",
    system_instruction="Always respond with valid JSON only.",
)

reply = llm.invoke("List three programming languages with their year of creation.")
print(reply)
# {
#   "languages": [
#     {"name": "Python",     "year": 1991},
#     {"name": "JavaScript", "year": 1995},
#     {"name": "Rust",       "year": 2010}
#   ]
# }

Multi-Turn Chat

Pass a list of role-tagged messages directly to carry conversation history.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

messages = [
    {"role": "user",      "content": "My favourite colour is blue."},
    {"role": "assistant", "content": "That is a great choice! Blue is calming and versatile."},
    {"role": "user",      "content": "What is my favourite colour?"},
]

reply = llm.chat(messages)
print(reply)
# Your favourite colour is blue!

Async multi-turn

import asyncio
from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

async def main():
    messages = [
        {"role": "user",      "content": "I work as a data scientist."},
        {"role": "assistant", "content": "That is a fascinating field!"},
        {"role": "user",      "content": "What is my job?"},
    ]
    reply = await llm.achat(messages)
    print(reply)
    # You work as a data scientist.

asyncio.run(main())

Building a conversation loop

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")
history = []

def chat(user_message: str) -> str:
    history.append({"role": "user", "content": user_message})
    reply = llm.chat(history)
    history.append({"role": "assistant", "content": reply})
    return reply

print(chat("My name is Jitin."))
# Nice to meet you, Jitin!

print(chat("I am building an AI framework called Autourgos."))
# That sounds exciting! What does Autourgos focus on?

print(chat("What is my name and what am I building?"))
# Your name is Jitin, and you are building an AI framework called Autourgos.

Cost Tracking

Pass pricing (USD per 1 million tokens) to get cost breakdowns.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    input_pricing=2.50,    # $2.50 per 1M input tokens
    output_pricing=10.00,  # $10.00 per 1M output tokens
    structured_output=True,
)

result = llm.invoke("Summarise the history of the internet in 3 sentences.")
print(result["model"])          # gpt-4o
print(result["response"])       # The internet began as ARPANET in the 1960s...
print(result["input_tokens"])   # 21
print(result["output_tokens"])  # 68
print(result["total_tokens"])   # 89
print(result["input_cost"])     # 0.0000525
print(result["output_cost"])    # 0.00068
print(result["total_cost"])     # 0.0007325
print(result["latency_ms"])     # 1102.4

Access the last call metadata without structured_output=True:

llm = OpenAIResponse(model="gpt-4o", input_pricing=2.50, output_pricing=10.00)
reply = llm.invoke("Hello!")
print(llm.last_metadata)
# {
#   "model": "gpt-4o",
#   "response": "Hello! How can I help you today?",
#   "input_tokens": 9,
#   "output_tokens": 10,
#   "total_tokens": 19,
#   "input_cost": 0.0000225,
#   "output_cost": 0.0001,
#   "total_cost": 0.0001225,
#   "latency_ms": 921.7
# }

Context Manager

Automatically closes the HTTP client when done.

from autourgos_responses import OpenAIResponse

with OpenAIResponse(model="gpt-4o") as llm:
    reply = llm.invoke("Quick question: what is 2 + 2?")
    print(reply)
    # 4
# Client is closed automatically here

Async context manager:

import asyncio
from autourgos_responses import OpenAIResponse

async def main():
    async with OpenAIResponse(model="gpt-4o") as llm:
        reply = await llm.ainvoke("What year did the Berlin Wall fall?")
        print(reply)
        # The Berlin Wall fell in 1989.

asyncio.run(main())

Circuit Breaker

Protects against cascading failures. After circuit_failure_threshold consecutive API errors, all calls are blocked for circuit_cooldown_time seconds, then automatically reset.

This is useful when you are using a local model (Ollama, LM Studio) or a rate-limited API — if the server goes down, the circuit breaker stops your code from hammering it with failed requests.

from autourgos_responses import OpenAIResponse, CircuitBreakerOpenException

llm = OpenAIResponse(
    model="gpt-4o",
    circuit_failure_threshold=3,   # open after 3 consecutive failures
    circuit_cooldown_time=60.0,    # block calls for 60 seconds
)

try:
    reply = llm.invoke("Hello!")
    print(reply)
except CircuitBreakerOpenException as e:
    print(f"Circuit is open, skipping call: {e}")
    # Circuit breaker OPEN for OpenAIResponse — 3 consecutive failures.
    # Blocked until 1718500060.0.

After the cooldown expires, the next call is allowed through as a probe. If it succeeds, the circuit resets to closed. If it fails again, the cooldown restarts.


Low-Level Access

Direct access to the raw Responses API object when you need full control.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

raw = llm.create("Explain gravity briefly.")
print(raw.output_text)
print(raw.usage.input_tokens)
print(raw.usage.output_tokens)

Async:

raw = await llm.acreate("Explain gravity briefly.")
print(raw.output_text)

With overrides:

raw = llm.create(
    "Summarise this.",
    temperature=0.3,
    max_output_tokens=50,
)

Error Handling

from autourgos_responses import (
    OpenAIResponse,
    OpenAIResponseAPIError,
    OpenAIResponseResponseError,
    OpenAIResponseConfigError,
    OpenAIResponseImportError,
    CircuitBreakerOpenException,
)

llm = OpenAIResponse(model="gpt-4o")

try:
    reply = llm.invoke("Hello!")
    print(reply)
except OpenAIResponseAPIError as e:
    # All retries exhausted — network issue or rate limit
    print(f"API error after retries: {e}")
except OpenAIResponseResponseError as e:
    # Response was received but no text could be extracted
    print(f"Could not parse response: {e}")
except OpenAIResponseConfigError as e:
    # Incompatible options e.g. streaming=True + structured_output=True
    print(f"Configuration error: {e}")
except OpenAIResponseImportError as e:
    # openai package is not installed
    print(f"openai not installed: {e}")
except CircuitBreakerOpenException as e:
    # Too many recent failures — circuit is open
    print(f"Circuit breaker is open: {e}")

Retry behaviour

By default the wrapper retries up to 3 times with exponential back-off:

Attempt Wait before retry
1st failure 0.5 s
2nd failure 1.0 s
3rd failure 2.0 s
4th failure raises OpenAIResponseAPIError

Change with max_retries and backoff_factor:

llm = OpenAIResponse(
    model="gpt-4o",
    max_retries=5,
    backoff_factor=1.0,   # waits: 1s, 2s, 4s, 8s then raises
)

Constructor Reference

Parameter Type Default Description
model str required Model name. e.g. "gpt-4o", "o3-mini", "llama3-70b-8192"
api_key str OPENAI_API_KEY env API key for the provider you are using
base_url str OPENAI_BASE_URL env Provider endpoint. e.g. "https://api.groq.com/openai/v1" or "http://localhost:11434/v1"
organization str None OpenAI organization ID (OpenAI only)
project str None OpenAI project ID (OpenAI only)
system_instruction str None System prompt sent as instructions field
prompt_template str None Template with {variable} placeholders
temperature float None Sampling temperature 0–2
top_p float None Nucleus sampling 0–1
max_tokens int None Maximum output tokens (maps to max_output_tokens)
reasoning_effort str None "low", "medium", or "high" — for o3, o3-mini, o1 only
reasoning_summary str None Include reasoning summary in output (OpenAI only)
text_verbosity str None "concise", "detailed", or "auto"
response_schema BaseModel / dict None Pydantic model or JSON schema for structured output
response_mime_type str None "application/json" enables JSON object mode
structured_output bool False If True, invoke() returns a metadata dict
streaming bool False If True, invoke() streams internally and joins
max_retries int 3 Retry attempts on transient API errors
timeout float 60.0 Request timeout in seconds
backoff_factor float 0.5 Exponential back-off base (wait = factor × 2^attempt)
input_pricing float None USD per 1 million input tokens
output_pricing float None USD per 1 million output tokens
circuit_failure_threshold int 5 Consecutive failures before the circuit opens
circuit_cooldown_time float 30.0 Seconds the circuit stays open before probing

What Each Method Returns

Method Returns
invoke(prompt) str — generated text (or dict if structured_output=True)
ainvoke(prompt) same as invoke, async
stream(prompt) Iterator[str] — text chunks
astream(prompt) AsyncIterator[str] — text chunks
batch_invoke(prompts) list[str] — one result per prompt, sequential
abatch_invoke(prompts) list[str] — concurrent results
chat(messages) str — generated text (or dict if structured_output=True)
achat(messages) same as chat, async
create(input_data) Raw OpenAI Responses API response object
acreate(input_data) same as create, async

Metadata dict keys (when structured_output=True or via llm.last_metadata)

Key Type Description
"model" str Model name used
"response" str Generated text
"input_tokens" int | None Input token count
"output_tokens" int | None Output token count
"total_tokens" int | None Total token count
"input_cost" float Input cost in USD (only if input_pricing set)
"output_cost" float Output cost in USD (only if output_pricing set)
"total_cost" float Total cost in USD (only if both pricing values set)
"latency_ms" float Request round-trip time in milliseconds

Supported Providers (quick reference)

Provider base_url Notes
OpenAI (default) GPT-4o, o3, o3-mini, o1, GPT-4o-mini
Groq https://api.groq.com/openai/v1 Llama 3, Mixtral, Gemma — very fast
Together AI https://api.together.xyz/v1 100+ open-source models
Mistral AI https://api.mistral.ai/v1 mistral-large, mixtral, codestral
DeepSeek https://api.deepseek.com/v1 deepseek-chat, deepseek-reasoner
Perplexity https://api.perplexity.ai Web-connected sonar models
Ollama http://localhost:11434/v1 Runs locally, no API key needed
LM Studio http://localhost:1234/v1 Runs locally, GUI-based
vLLM http://your-server:8000/v1 Self-hosted, high throughput
Azure OpenAI https://<resource>.openai.azure.com/... Enterprise OpenAI

Differences vs autourgos-openaichat

Feature autourgos-openaichat autourgos-responses
API endpoint chat.completions.create responses.create
System prompt field messages[0].role = "system" instructions parameter
Reasoning models Not supported reasoning_effort param for o3/o1
Text verbosity control Not supported text_verbosity param
Multi-turn input Messages list Messages list or plain string
Native tool calling Supported Not yet in Responses API
Use when Building chat agents, tool-calling Using reasoning models, simple generation

Both packages support the same providers via base_url. Choose based on the API endpoint your use case needs.


License

MIT — Copyright (c) 2026 Jitin Kumar Sengar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autourgos_responses-1.0.0.tar.gz (42.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autourgos_responses-1.0.0-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file autourgos_responses-1.0.0.tar.gz.

File metadata

  • Download URL: autourgos_responses-1.0.0.tar.gz
  • Upload date:
  • Size: 42.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for autourgos_responses-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f5d6ebb9c840ec182c31386f3e47158e183b95aad08a95e61220f753dcf68ad3
MD5 40c237bafdeb1104f9650c1c21666cee
BLAKE2b-256 0accca6ca70da225689d71f20b3ec1344c1a0c6971732ba4e72756b1cbe78945

See more details on using hashes here.

File details

Details for the file autourgos_responses-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for autourgos_responses-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f3da620d4288ff8b0a8f113abdb28f337a1648eaea62f7c3ccc783d816bb52f
MD5 ac596609fbf907c5f07bd818962fa83f
BLAKE2b-256 47c17e5c3f2b3f5b09bae7b57f6ccbe10cdecc810e66baeec8d670b049272dd7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page