Autourgos LLM wrapper for the OpenAI Responses API

These details have not been verified by PyPI

Project links

Project description

autourgos-responses

LLM wrapper for the OpenAI Responses API (client.responses.create), part of the Autourgos framework.

Fully self-contained — no autourgos-core dependency required. Just pip install openai and you are ready.

The Responses API is OpenAI's newer, stateful endpoint that supports reasoning models (o3, o3-mini, o1), built-in tools, and multi-turn input natively.

Why use this?

Almost every major LLM provider today — Groq, Together AI, Mistral, Perplexity, DeepSeek, Ollama, LM Studio, vLLM, Azure OpenAI — exposes an OpenAI-compatible API. This means they all accept the same request format.

autourgos-responses takes advantage of this. You set base_url to any provider's endpoint and model to whatever model they offer. One package, any LLM. You never have to learn a new SDK or rewrite your code when you switch providers.

The Responses API gives you extra power on top: native reasoning models (o3, o3-mini, o1) with configurable thinking effort, text verbosity control, and cleaner multi-turn conversation handling.

OpenAI (gpt-4o, o3, o3-mini, o1) ──────────┐
Groq (Llama, Mixtral, Gemma) ───────────────┤
Together AI (70B, 8x7B, ...) ───────────────┤  autourgos-responses
Mistral AI (mistral-large, ...) ────────────┤  (one interface)
DeepSeek (deepseek-chat, ...) ──────────────┤
Perplexity (sonar models) ──────────────────┤
Ollama — any local model ───────────────────┤
LM Studio — any local model ────────────────┤
vLLM — self-hosted ─────────────────────────┤
Azure OpenAI ───────────────────────────────┘

Install
Works With Any LLM
Quick Start
Basic Text Generation
Async Generation
Streaming
Async Streaming
Batch Invocation
System Instruction
Prompt Templates
Reasoning Models
Multi-Modal Vision Input
Structured Output
JSON Mode
Multi-Turn Chat
Cost Tracking
Context Manager
Circuit Breaker
Error Handling
Constructor Reference
What Each Method Returns
Differences vs autourgos-openaichat

Install

pip install autourgos-responses

Requires Python 3.10+ and openai>=1.0.0.

Works With Any LLM

All you need to switch providers is base_url and the right model name. Your API key comes from the provider you choose.

OpenAI (default)

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    api_key="sk-...",           # or set OPENAI_API_KEY env var
)
reply = llm.invoke("What is the capital of France?")
print(reply)
# Paris

OpenAI reasoning models

These are special to OpenAI's Responses API. They support reasoning_effort to control how long the model thinks before answering.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="o3-mini",
    api_key="sk-...",
    reasoning_effort="high",   # "low", "medium", or "high"
)
reply = llm.invoke("Prove that the square root of 2 is irrational.")
print(reply)
# Assume for contradiction that √2 = p/q in lowest terms...

Groq — fastest inference, free tier available

Groq runs open-source models (Llama 3, Mixtral, Gemma) at extremely high speed. Get your key at https://console.groq.com.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="llama3-70b-8192",
    api_key="gsk_...",          # Groq API key
    base_url="https://api.groq.com/openai/v1",
)
reply = llm.invoke("Explain quantum entanglement simply.")
print(reply)
# Quantum entanglement is when two particles become linked so that
# the state of one instantly affects the other, no matter how far apart they are.

Other Groq models: llama3-8b-8192, mixtral-8x7b-32768, gemma2-9b-it

Together AI — wide model selection

Together AI hosts hundreds of open-source models. Get your key at https://api.together.xyz.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="meta-llama/Llama-3-70b-chat-hf",
    api_key="...",              # Together AI key
    base_url="https://api.together.xyz/v1",
)
reply = llm.invoke("Write a Python function to check if a number is prime.")
print(reply)
# def is_prime(n: int) -> bool:
#     if n < 2:
#         return False
#     for i in range(2, int(n**0.5) + 1):
#         if n % i == 0:
#             return False
#     return True

Other Together AI models: mistralai/Mixtral-8x7B-Instruct-v0.1, Qwen/Qwen2-72B-Instruct

Mistral AI

Get your key at https://console.mistral.ai.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="mistral-large-latest",
    api_key="...",              # Mistral API key
    base_url="https://api.mistral.ai/v1",
)
reply = llm.invoke("What are the benefits of test-driven development?")
print(reply)
# TDD helps you write cleaner code, catch bugs early, and gives
# you confidence to refactor without breaking existing behaviour.

Other Mistral models: mistral-medium-latest, mistral-small-latest, open-mixtral-8x7b

DeepSeek

Get your key at https://platform.deepseek.com.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="deepseek-chat",
    api_key="...",              # DeepSeek API key
    base_url="https://api.deepseek.com/v1",
)
reply = llm.invoke("What is a transformer neural network?")
print(reply)
# A transformer is a neural network architecture that uses self-attention
# to process input sequences in parallel, making it highly effective for
# NLP tasks like translation, summarisation, and text generation.

Other DeepSeek models: deepseek-reasoner

Perplexity — web-connected models

Perplexity's Sonar models can search the web in real time. Get your key at https://www.perplexity.ai/settings/api.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="llama-3.1-sonar-large-128k-online",
    api_key="pplx-...",        # Perplexity API key
    base_url="https://api.perplexity.ai",
)
reply = llm.invoke("What are the top AI news stories today?")
print(reply)
# Today's top AI stories include...

Ollama — run any model locally, no internet needed

Ollama runs models entirely on your machine. Install from https://ollama.com, then pull a model:

ollama pull llama3

No API key needed for local use.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="llama3",
    api_key="ollama",           # can be any string — Ollama ignores it
    base_url="http://localhost:11434/v1",
)
reply = llm.invoke("What is the difference between RAM and ROM?")
print(reply)
# RAM (Random Access Memory) is fast, temporary storage your computer uses
# while running programs. ROM (Read-Only Memory) is permanent storage that
# holds firmware your computer needs to boot up.

Other Ollama models: mistral, phi3, gemma2, codellama, qwen2 — anything you pull with ollama pull.

LM Studio — local models with a GUI

LM Studio lets you download and run GGUF models locally. Start the local server in LM Studio, then:

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="local-model",        # use whatever model name LM Studio shows
    api_key="lm-studio",        # any string — ignored locally
    base_url="http://localhost:1234/v1",
)
reply = llm.invoke("Explain recursion in simple terms.")
print(reply)
# Recursion is when a function calls itself to solve a smaller version
# of the same problem, until it reaches a base case that stops the loop.

vLLM — self-hosted high-throughput serving

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    api_key="EMPTY",            # vLLM default when no auth is set
    base_url="http://your-server:8000/v1",
)
reply = llm.invoke("What is the capital of Japan?")
print(reply)
# Tokyo

Azure OpenAI

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",             # your deployment name in Azure
    api_key="...",              # Azure OpenAI key
    base_url="https://<your-resource>.openai.azure.com/openai/deployments/gpt-4o",
)
reply = llm.invoke("What is cloud computing?")
print(reply)
# Cloud computing is the delivery of computing services over the internet
# on a pay-as-you-go basis.

Switching providers at runtime

from autourgos_responses import OpenAIResponse

PROVIDERS = {
    "openai": {
        "model": "gpt-4o-mini",
        "api_key": "sk-...",
        "base_url": None,
    },
    "groq": {
        "model": "llama3-8b-8192",
        "api_key": "gsk_...",
        "base_url": "https://api.groq.com/openai/v1",
    },
    "ollama": {
        "model": "llama3",
        "api_key": "ollama",
        "base_url": "http://localhost:11434/v1",
    },
}

for name, cfg in PROVIDERS.items():
    llm = OpenAIResponse(**cfg)
    reply = llm.invoke("Say hello in one word.")
    print(f"{name}: {reply}")

# openai: Hello!
# groq:   Hello!
# ollama: Hello!

Quick Start

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")
reply = llm.invoke("What is the capital of France?")
print(reply)
# Paris

Basic Text Generation

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    api_key="sk-...",        # or set OPENAI_API_KEY env var
    temperature=0.7,
    max_tokens=256,
)

reply = llm.invoke("Explain machine learning in one sentence.")
print(reply)
# Machine learning is a branch of AI where systems learn from data
# to make predictions or decisions without being explicitly programmed.

Async Generation

import asyncio
from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

async def main():
    reply = await llm.ainvoke("What is the speed of light?")
    print(reply)
    # The speed of light in a vacuum is approximately 299,792,458 metres per second.

asyncio.run(main())

Streaming

Stream the response token by token synchronously.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

for chunk in llm.stream("Write a haiku about mountains."):
    print(chunk, end="", flush=True)

# Silent peaks above,
# Clouds drift through the ancient stone,
# Eagles trace the wind.

You can also enable streaming at construction time so invoke() internally streams and returns the full joined text:

llm = OpenAIResponse(model="gpt-4o", streaming=True)
reply = llm.invoke("Tell me a fun fact about space.")
print(reply)
# A day on Venus is longer than a year on Venus — it takes 243 Earth days
# to rotate once but only 225 Earth days to orbit the Sun.

Async Streaming

import asyncio
from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

async def main():
    async for chunk in llm.astream("Count prime numbers up to 20."):
        print(chunk, end="", flush=True)
    # 2, 3, 5, 7, 11, 13, 17, 19

asyncio.run(main())

Batch Invocation

Synchronous (sequential)

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o-mini")

prompts = [
    "Capital of Japan?",
    "Capital of Germany?",
    "Capital of Brazil?",
]

results = llm.batch_invoke(prompts)
for prompt, result in zip(prompts, results):
    print(f"{prompt} -> {result}")

# Capital of Japan?   -> Tokyo
# Capital of Germany? -> Berlin
# Capital of Brazil?  -> Brasilia

Async (concurrent)

import asyncio
from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o-mini")

async def main():
    results = await llm.abatch_invoke([
        "Capital of Japan?",
        "Capital of Germany?",
        "Capital of Brazil?",
    ])
    print(results)
    # ['Tokyo', 'Berlin', 'Brasilia']

asyncio.run(main())

System Instruction

Set a persistent system prompt sent as the instructions field of every request.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    system_instruction="You are a concise assistant. Always reply in exactly one sentence.",
)

reply = llm.invoke("What is photosynthesis?")
print(reply)
# Photosynthesis is the process by which plants use sunlight, water, and CO2
# to produce glucose and oxygen.

Prompt Templates

Define a reusable template with {placeholders} and fill them at call time.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    prompt_template="Summarise the following {topic} in {num_words} words:\n\n{content}",
)

reply = llm.invoke(prompt_variables={
    "topic": "article",
    "num_words": "30",
    "content": "Quantum computing uses quantum bits (qubits) that can exist in superposition...",
})
print(reply)
# Quantum computing uses qubits in superposition to perform many calculations
# simultaneously, offering vastly superior speeds for specific complex problems
# like cryptography and molecular simulation.

Missing variables raise a clear error:

llm.invoke(prompt_variables={"topic": "article"})
# ValueError: Missing prompt template variables: content, num_words

Reasoning Models

o3, o3-mini, and o1 are OpenAI's reasoning models. They support reasoning_effort to control how long the model thinks before answering. Higher effort produces better answers for hard problems but takes longer and costs more.

Reasoning models are only available from OpenAI. When using other providers, omit reasoning_effort.

reasoning_effort

Valid values: "low", "medium", "high".

from autourgos_responses import OpenAIResponse

# Low effort — fast, cheaper
llm = OpenAIResponse(model="o3-mini", reasoning_effort="low")
reply = llm.invoke("What is 17 × 23?")
print(reply)
# 391

# Medium effort — balanced
llm = OpenAIResponse(model="o3-mini", reasoning_effort="medium")
reply = llm.invoke("Solve: if a train travels at 80 km/h for 2.5 hours, how far does it go?")
print(reply)
# The train travels 200 km. (80 km/h × 2.5 h = 200 km)

# High effort — most thorough, best for hard problems
llm = OpenAIResponse(model="o3", reasoning_effort="high")
reply = llm.invoke(
    "Prove that the square root of 2 is irrational."
)
print(reply)
# Assume for contradiction that √2 = p/q where p and q are integers with no common factors...

When to use each level

effort	Use for	Speed	Cost
`"low"`	Simple maths, factual Q&A, quick summaries	Very fast	Lowest
`"medium"`	Multi-step reasoning, code generation	Moderate	Medium
`"high"`	Hard proofs, complex analysis, frontier research	Slow	Highest

Invalid effort raises immediately

OpenAIResponse(model="o3-mini", reasoning_effort="ultra")
# ValueError: Invalid reasoning_effort 'ultra'. Must be one of: ['high', 'low', 'medium']

Multi-Modal Vision Input

Pass image files, URLs, or raw bytes alongside text.

Note: vision support depends on the provider and model. GPT-4o, LLaVA (Ollama), and several others support it.

From a file path

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")
reply = llm.invoke("What objects are in this image?", files=["photo.jpg"])
print(reply)
# The image shows a wooden desk with a laptop, a coffee mug, and an open notebook.

From a URL

reply = llm.invoke(
    "Describe this chart in detail.",
    files=["https://example.com/sales-chart.png"],
)
print(reply)
# The chart is a bar graph comparing quarterly revenue across four product lines.
# Q3 shows the highest sales at approximately $2.4M for Product A...

From raw bytes

with open("diagram.png", "rb") as f:
    image_bytes = f.read()

reply = llm.invoke("Explain this architecture diagram.", files=[image_bytes])
print(reply)
# The diagram shows a microservices architecture with an API gateway at the top
# routing requests to three downstream services: Auth, Orders, and Payments...

Multiple images

reply = llm.invoke(
    "Which image shows more people?",
    files=["crowd1.jpg", "crowd2.jpg"],
)
print(reply)
# The first image shows more people — it appears to be a large outdoor concert
# with thousands of attendees, while the second shows a small group of around 20.

Structured Output

Return data that matches a Pydantic model automatically.

from pydantic import BaseModel, Field
from autourgos_responses import OpenAIResponse

class WeatherReport(BaseModel):
    city: str = Field(description="Name of the city")
    temperature_celsius: float = Field(description="Current temperature in Celsius")
    condition: str = Field(description="Weather condition e.g. Sunny, Rainy")
    humidity_percent: int = Field(description="Humidity percentage 0-100")

llm = OpenAIResponse(model="gpt-4o", response_schema=WeatherReport)
result = llm.invoke("Describe a typical summer day in London.")

import json
data = json.loads(result["response"])
print(data)
# {
#   "city": "London",
#   "temperature_celsius": 22.0,
#   "condition": "Partly Cloudy",
#   "humidity_percent": 65
# }

Use a plain dict schema instead of Pydantic:

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age":  {"type": "integer"},
    },
    "required": ["name", "age"],
}

llm = OpenAIResponse(model="gpt-4o", response_schema=schema)
result = llm.invoke("Invent a fictional person.")
print(result["response"])
# {"name": "Mira Caldwell", "age": 34}

JSON Mode

Force valid JSON output without defining a schema.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    response_mime_type="application/json",
    system_instruction="Always respond with valid JSON only.",
)

reply = llm.invoke("List three programming languages with their year of creation.")
print(reply)
# {
#   "languages": [
#     {"name": "Python",     "year": 1991},
#     {"name": "JavaScript", "year": 1995},
#     {"name": "Rust",       "year": 2010}
#   ]
# }

Multi-Turn Chat

Pass a list of role-tagged messages directly to carry conversation history.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

messages = [
    {"role": "user",      "content": "My favourite colour is blue."},
    {"role": "assistant", "content": "That is a great choice! Blue is calming and versatile."},
    {"role": "user",      "content": "What is my favourite colour?"},
]

reply = llm.chat(messages)
print(reply)
# Your favourite colour is blue!

Async multi-turn

import asyncio
from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

async def main():
    messages = [
        {"role": "user",      "content": "I work as a data scientist."},
        {"role": "assistant", "content": "That is a fascinating field!"},
        {"role": "user",      "content": "What is my job?"},
    ]
    reply = await llm.achat(messages)
    print(reply)
    # You work as a data scientist.

asyncio.run(main())

Building a conversation loop

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")
history = []

def chat(user_message: str) -> str:
    history.append({"role": "user", "content": user_message})
    reply = llm.chat(history)
    history.append({"role": "assistant", "content": reply})
    return reply

print(chat("My name is Jitin."))
# Nice to meet you, Jitin!

print(chat("I am building an AI framework called Autourgos."))
# That sounds exciting! What does Autourgos focus on?

print(chat("What is my name and what am I building?"))
# Your name is Jitin, and you are building an AI framework called Autourgos.

Cost Tracking

Pass pricing (USD per 1 million tokens) to get cost breakdowns.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(
    model="gpt-4o",
    input_pricing=2.50,    # $2.50 per 1M input tokens
    output_pricing=10.00,  # $10.00 per 1M output tokens
    structured_output=True,
)

result = llm.invoke("Summarise the history of the internet in 3 sentences.")
print(result["model"])          # gpt-4o
print(result["response"])       # The internet began as ARPANET in the 1960s...
print(result["input_tokens"])   # 21
print(result["output_tokens"])  # 68
print(result["total_tokens"])   # 89
print(result["input_cost"])     # 0.0000525
print(result["output_cost"])    # 0.00068
print(result["total_cost"])     # 0.0007325
print(result["latency_ms"])     # 1102.4

Access the last call metadata without structured_output=True:

llm = OpenAIResponse(model="gpt-4o", input_pricing=2.50, output_pricing=10.00)
reply = llm.invoke("Hello!")
print(llm.last_metadata)
# {
#   "model": "gpt-4o",
#   "response": "Hello! How can I help you today?",
#   "input_tokens": 9,
#   "output_tokens": 10,
#   "total_tokens": 19,
#   "input_cost": 0.0000225,
#   "output_cost": 0.0001,
#   "total_cost": 0.0001225,
#   "latency_ms": 921.7
# }

Context Manager

Automatically closes the HTTP client when done.

from autourgos_responses import OpenAIResponse

with OpenAIResponse(model="gpt-4o") as llm:
    reply = llm.invoke("Quick question: what is 2 + 2?")
    print(reply)
    # 4
# Client is closed automatically here

Async context manager:

import asyncio
from autourgos_responses import OpenAIResponse

async def main():
    async with OpenAIResponse(model="gpt-4o") as llm:
        reply = await llm.ainvoke("What year did the Berlin Wall fall?")
        print(reply)
        # The Berlin Wall fell in 1989.

asyncio.run(main())

Circuit Breaker

Protects against cascading failures. After circuit_failure_threshold consecutive API errors, all calls are blocked for circuit_cooldown_time seconds, then automatically reset.

This is useful when you are using a local model (Ollama, LM Studio) or a rate-limited API — if the server goes down, the circuit breaker stops your code from hammering it with failed requests.

from autourgos_responses import OpenAIResponse, CircuitBreakerOpenException

llm = OpenAIResponse(
    model="gpt-4o",
    circuit_failure_threshold=3,   # open after 3 consecutive failures
    circuit_cooldown_time=60.0,    # block calls for 60 seconds
)

try:
    reply = llm.invoke("Hello!")
    print(reply)
except CircuitBreakerOpenException as e:
    print(f"Circuit is open, skipping call: {e}")
    # Circuit breaker OPEN for OpenAIResponse — 3 consecutive failures.
    # Blocked until 1718500060.0.

After the cooldown expires, the next call is allowed through as a probe. If it succeeds, the circuit resets to closed. If it fails again, the cooldown restarts.

Low-Level Access

Direct access to the raw Responses API object when you need full control.

from autourgos_responses import OpenAIResponse

llm = OpenAIResponse(model="gpt-4o")

raw = llm.create("Explain gravity briefly.")
print(raw.output_text)
print(raw.usage.input_tokens)
print(raw.usage.output_tokens)

Async:

raw = await llm.acreate("Explain gravity briefly.")
print(raw.output_text)

With overrides:

raw = llm.create(
    "Summarise this.",
    temperature=0.3,
    max_output_tokens=50,
)

Error Handling

from autourgos_responses import (
    OpenAIResponse,
    OpenAIResponseAPIError,
    OpenAIResponseResponseError,
    OpenAIResponseConfigError,
    OpenAIResponseImportError,
    CircuitBreakerOpenException,
)

llm = OpenAIResponse(model="gpt-4o")

try:
    reply = llm.invoke("Hello!")
    print(reply)
except OpenAIResponseAPIError as e:
    # All retries exhausted — network issue or rate limit
    print(f"API error after retries: {e}")
except OpenAIResponseResponseError as e:
    # Response was received but no text could be extracted
    print(f"Could not parse response: {e}")
except OpenAIResponseConfigError as e:
    # Incompatible options e.g. streaming=True + structured_output=True
    print(f"Configuration error: {e}")
except OpenAIResponseImportError as e:
    # openai package is not installed
    print(f"openai not installed: {e}")
except CircuitBreakerOpenException as e:
    # Too many recent failures — circuit is open
    print(f"Circuit breaker is open: {e}")

Retry behaviour

By default the wrapper retries up to 3 times with exponential back-off:

Attempt	Wait before retry
1st failure	0.5 s
2nd failure	1.0 s
3rd failure	2.0 s
4th failure	raises `OpenAIResponseAPIError`

Change with max_retries and backoff_factor:

llm = OpenAIResponse(
    model="gpt-4o",
    max_retries=5,
    backoff_factor=1.0,   # waits: 1s, 2s, 4s, 8s then raises
)

Constructor Reference

Parameter	Type	Default	Description
`model`	`str`	required	Model name. e.g. `"gpt-4o"`, `"o3-mini"`, `"llama3-70b-8192"`
`api_key`	`str`	`OPENAI_API_KEY` env	API key for the provider you are using
`base_url`	`str`	`OPENAI_BASE_URL` env	Provider endpoint. e.g. `"https://api.groq.com/openai/v1"` or `"http://localhost:11434/v1"`
`organization`	`str`	`None`	OpenAI organization ID (OpenAI only)
`project`	`str`	`None`	OpenAI project ID (OpenAI only)
`system_instruction`	`str`	`None`	System prompt sent as `instructions` field
`prompt_template`	`str`	`None`	Template with `{variable}` placeholders
`temperature`	`float`	`None`	Sampling temperature 0–2
`top_p`	`float`	`None`	Nucleus sampling 0–1
`max_tokens`	`int`	`None`	Maximum output tokens (maps to `max_output_tokens`)
`reasoning_effort`	`str`	`None`	`"low"`, `"medium"`, or `"high"` — for o3, o3-mini, o1 only
`reasoning_summary`	`str`	`None`	Include reasoning summary in output (OpenAI only)
`text_verbosity`	`str`	`None`	`"concise"`, `"detailed"`, or `"auto"`
`response_schema`	`BaseModel` / `dict`	`None`	Pydantic model or JSON schema for structured output
`response_mime_type`	`str`	`None`	`"application/json"` enables JSON object mode
`structured_output`	`bool`	`False`	If `True`, `invoke()` returns a metadata dict
`streaming`	`bool`	`False`	If `True`, `invoke()` streams internally and joins
`max_retries`	`int`	`3`	Retry attempts on transient API errors
`timeout`	`float`	`60.0`	Request timeout in seconds
`backoff_factor`	`float`	`0.5`	Exponential back-off base (wait = factor × 2^attempt)
`input_pricing`	`float`	`None`	USD per 1 million input tokens
`output_pricing`	`float`	`None`	USD per 1 million output tokens
`circuit_failure_threshold`	`int`	`5`	Consecutive failures before the circuit opens
`circuit_cooldown_time`	`float`	`30.0`	Seconds the circuit stays open before probing

What Each Method Returns

Method	Returns
`invoke(prompt)`	`str` — generated text (or `dict` if `structured_output=True`)
`ainvoke(prompt)`	same as `invoke`, async
`stream(prompt)`	`Iterator[str]` — text chunks
`astream(prompt)`	`AsyncIterator[str]` — text chunks
`batch_invoke(prompts)`	`list[str]` — one result per prompt, sequential
`abatch_invoke(prompts)`	`list[str]` — concurrent results
`chat(messages)`	`str` — generated text (or `dict` if `structured_output=True`)
`achat(messages)`	same as `chat`, async
`create(input_data)`	Raw OpenAI Responses API response object
`acreate(input_data)`	same as `create`, async

Metadata dict keys (when `structured_output=True` or via `llm.last_metadata`)

Key	Type	Description
`"model"`	`str`	Model name used
`"response"`	`str`	Generated text
`"input_tokens"`	`int \| None`	Input token count
`"output_tokens"`	`int \| None`	Output token count
`"total_tokens"`	`int \| None`	Total token count
`"input_cost"`	`float`	Input cost in USD (only if `input_pricing` set)
`"output_cost"`	`float`	Output cost in USD (only if `output_pricing` set)
`"total_cost"`	`float`	Total cost in USD (only if both pricing values set)
`"latency_ms"`	`float`	Request round-trip time in milliseconds

Supported Providers (quick reference)

Provider	base_url	Notes
OpenAI	(default)	GPT-4o, o3, o3-mini, o1, GPT-4o-mini
Groq	`https://api.groq.com/openai/v1`	Llama 3, Mixtral, Gemma — very fast
Together AI	`https://api.together.xyz/v1`	100+ open-source models
Mistral AI	`https://api.mistral.ai/v1`	mistral-large, mixtral, codestral
DeepSeek	`https://api.deepseek.com/v1`	deepseek-chat, deepseek-reasoner
Perplexity	`https://api.perplexity.ai`	Web-connected sonar models
Ollama	`http://localhost:11434/v1`	Runs locally, no API key needed
LM Studio	`http://localhost:1234/v1`	Runs locally, GUI-based
vLLM	`http://your-server:8000/v1`	Self-hosted, high throughput
Azure OpenAI	`https://<resource>.openai.azure.com/...`	Enterprise OpenAI

Differences vs autourgos-openaichat

Feature	autourgos-openaichat	autourgos-responses
API endpoint	`chat.completions.create`	`responses.create`
System prompt field	`messages[0].role = "system"`	`instructions` parameter
Reasoning models	Not supported	`reasoning_effort` param for o3/o1
Text verbosity control	Not supported	`text_verbosity` param
Multi-turn input	Messages list	Messages list or plain string
Native tool calling	Supported	Not yet in Responses API
Use when	Building chat agents, tool-calling	Using reasoning models, simple generation

Both packages support the same providers via base_url. Choose based on the API endpoint your use case needs.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Jun 16, 2026

This version

1.0.0

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autourgos_responses-1.0.0.tar.gz (42.3 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autourgos_responses-1.0.0-py3-none-any.whl (26.1 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file autourgos_responses-1.0.0.tar.gz.

File metadata

Download URL: autourgos_responses-1.0.0.tar.gz
Upload date: Jun 16, 2026
Size: 42.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for autourgos_responses-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f5d6ebb9c840ec182c31386f3e47158e183b95aad08a95e61220f753dcf68ad3`
MD5	`40c237bafdeb1104f9650c1c21666cee`
BLAKE2b-256	`0accca6ca70da225689d71f20b3ec1344c1a0c6971732ba4e72756b1cbe78945`

See more details on using hashes here.

File details

Details for the file autourgos_responses-1.0.0-py3-none-any.whl.

File metadata

Download URL: autourgos_responses-1.0.0-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 26.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for autourgos_responses-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f3da620d4288ff8b0a8f113abdb28f337a1648eaea62f7c3ccc783d816bb52f`
MD5	`ac596609fbf907c5f07bd818962fa83f`
BLAKE2b-256	`47c17e5c3f2b3f5b09bae7b57f6ccbe10cdecc810e66baeec8d670b049272dd7`

See more details on using hashes here.

autourgos-responses 1.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

autourgos-responses

Why use this?

Table of Contents

Install

Works With Any LLM

OpenAI (default)

OpenAI reasoning models

Groq — fastest inference, free tier available

Together AI — wide model selection

Mistral AI

DeepSeek

Perplexity — web-connected models

Ollama — run any model locally, no internet needed

LM Studio — local models with a GUI

vLLM — self-hosted high-throughput serving

Azure OpenAI

Switching providers at runtime

Quick Start

Basic Text Generation

Async Generation

Streaming

Async Streaming

Batch Invocation

Synchronous (sequential)

Async (concurrent)

System Instruction

Prompt Templates

Reasoning Models

reasoning_effort

When to use each level

Invalid effort raises immediately

Multi-Modal Vision Input

From a file path

From a URL

From raw bytes

Multiple images

Structured Output

JSON Mode

Multi-Turn Chat

Async multi-turn

Building a conversation loop

Cost Tracking

Context Manager

Circuit Breaker

Low-Level Access

Error Handling

Retry behaviour

Constructor Reference

What Each Method Returns

Metadata dict keys (when structured_output=True or via llm.last_metadata)

Supported Providers (quick reference)

Differences vs autourgos-openaichat

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

Metadata dict keys (when `structured_output=True` or via `llm.last_metadata`)