A unified async Python wrapper for multiple LLM providers with OpenAI Response API and reasoning support
Project description
SmartLLM
A unified async Python wrapper for multiple LLM providers with a consistent interface.
Features
- Unified Interface — Single API for OpenAI and AWS Bedrock
- Async/Await — Built on asyncio for concurrent requests
- Smart Caching — Two-level cache (local JSON + optional DynamoDB)
- Auto Retry — Exponential backoff for transient failures
- Structured Output — Native Pydantic model support
- Streaming — Real-time streaming responses
- Rate Limiting — Built-in concurrency control per model
- Reasoning Models — Full support including
reasoning_effortandreasoning_tokens - Progress Callbacks — Optional
on_progressfor real-time events
Installation
pip install smartllm[openai] # OpenAI only
pip install smartllm[bedrock] # AWS Bedrock only
pip install smartllm[all] # All providers
Quick Start
import asyncio
from smartllm import LLMClient, TextRequest
async def main():
async with LLMClient(provider="openai") as client:
response = await client.generate_text(
TextRequest(prompt="What is the capital of France?")
)
print(response.text)
asyncio.run(main())
Configuration
Environment Variables
OpenAI:
export OPENAI_API_KEY="your-api-key"
export OPENAI_MODEL="gpt-4o-mini" # optional
AWS Bedrock:
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"
export BEDROCK_MODEL="anthropic.claude-3-sonnet-20240229-v1:0" # optional
Explicit credentials are optional. If omitted, boto3's default credential chain is used — including EC2 instance profiles, ECS task roles, Lambda execution roles, and ~/.aws/credentials.
Programmatic Configuration
from smartllm import LLMClient, LLMConfig
config = LLMConfig(
provider="openai",
api_key="your-api-key",
default_model="gpt-4o",
temperature=0.7,
max_tokens=2048,
max_retries=3,
)
async with LLMClient(config) as client:
...
Usage Examples
Multi-turn Conversations
from smartllm import LLMClient, MessageRequest, Message
async with LLMClient(provider="openai") as client:
messages = [
Message(role="user", content="My name is Alice."),
Message(role="assistant", content="Nice to meet you, Alice!"),
Message(role="user", content="What's my name?"),
]
response = await client.send_message(MessageRequest(messages=messages))
print(response.text) # "Your name is Alice."
Structured Output
from pydantic import BaseModel
from smartllm import LLMClient, TextRequest
class Person(BaseModel):
name: str
age: int
async with LLMClient(provider="openai") as client:
response = await client.generate_text(
TextRequest(prompt="Return a person named John, age 30.", response_format=Person)
)
print(response.structured_data.name) # "John"
Streaming
async with LLMClient(provider="openai") as client:
async for chunk in client.generate_text_stream(
TextRequest(prompt="Write a short poem.", stream=True)
):
print(chunk.text, end="", flush=True)
Reasoning Models
response = await client.generate_text(
TextRequest(
prompt="Solve: what is the 100th Fibonacci number?",
reasoning_effort="high", # "low", "medium", or "high"
)
)
print(response.text)
print(f"Reasoning tokens: {response.reasoning_tokens}")
Note: reasoning models do not support temperature. Passing a value other than 1 raises ValueError.
OpenAI API Types
# Responses API (default, recommended)
TextRequest(prompt="Hello", api_type="responses")
# Chat Completions API (legacy)
TextRequest(prompt="Hello", api_type="chat_completions")
Concurrent Requests
tasks = [client.generate_text(TextRequest(prompt=p)) for p in prompts]
responses = await asyncio.gather(*tasks)
Progress Callbacks
async def on_progress(event):
print(event)
response = await client.generate_text(
TextRequest(prompt="Hello", on_progress=on_progress)
)
Events: llm_started, llm_done, cache_hit (with cache_source, cache_key), error (with message). Each event dict includes event, ts, prompt, model, provider. llm_done and cache_hit also include input_tokens, output_tokens, reasoning_tokens, cached_tokens.
DynamoDB Caching
async with LLMClient(provider="openai", dynamo_table_name="my-llm-cache") as client:
...
Requires AWS credentials with DynamoDB access. Table is auto-created if it doesn't exist.
Provider-Specific Clients
from smartllm.openai import OpenAILLMClient, OpenAIConfig
from smartllm.bedrock import BedrockLLMClient, BedrockConfig
async with OpenAILLMClient(OpenAIConfig(api_key="...")) as client:
models = await client.list_available_models()
async with BedrockLLMClient(BedrockConfig(aws_region="us-east-1")) as client:
models = await client.list_available_model_ids()
API Reference
TextRequest Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
prompt |
str | Input text prompt | Required |
model |
str | Model ID | Config default |
temperature |
float | Sampling temperature (0–1) | 0 |
max_tokens |
int | Maximum output tokens | 2048 |
top_p |
float | Nucleus sampling | 1.0 |
system_prompt |
str | System context | None |
stream |
bool | Enable streaming | False |
response_format |
BaseModel | Pydantic model for structured output | None |
use_cache |
bool | Enable caching | True |
clear_cache |
bool | Clear cache before request | False |
api_type |
str | "responses" or "chat_completions" |
"responses" |
reasoning_effort |
str | "low", "medium", or "high" |
None |
on_progress |
Callable | Progress event callback (sync or async) | None |
TextResponse Fields
| Field | Type | Description |
|---|---|---|
text |
str | Generated text |
model |
str | Model that generated the response |
stop_reason |
str | Reason generation stopped |
input_tokens |
int | Input token count |
output_tokens |
int | Output token count |
reasoning_tokens |
int | Reasoning tokens used (OpenAI only, 0 otherwise) |
cached_tokens |
int | Prompt cache tokens (OpenAI only, 0 otherwise) |
timestamp |
str | None | ISO 8601 UTC timestamp of the original API call |
elapsed_seconds |
float | None | Duration of the original API call in seconds |
metadata |
dict | Request context: prompt/messages and response_format JSON schema |
structured_data |
BaseModel | None | Parsed Pydantic object (when response_format was set) |
cache_source |
str | "miss", "l1" (local), or "l2" (DynamoDB) |
cache_key |
str | None | Cache key for this request |
Structured Output Error Handling
When using response_format, two error conditions are raised explicitly:
Truncated output — if the provider cuts off the response before the structured output is complete, a ValueError is raised:
try:
response = await client.generate_text(
TextRequest(prompt="...", response_format=MyModel, max_tokens=100)
)
except ValueError as e:
print(e) # "Bedrock truncated structured output (stop_reason=max_tokens)"
# "OpenAI truncated structured output (finish_reason=length)"
# "OpenAI truncated structured output (status=incomplete)"
Increase max_tokens to avoid this.
Provider serialization quirks — Bedrock occasionally returns list fields as JSON strings rather than inline arrays. Pydantic's model_validate is used internally to handle coercion where possible. If your model has list fields and you still see ValidationError, add a field validator:
import json
from pydantic import BaseModel, field_validator
class BookList(BaseModel):
books: list[str]
@field_validator("books", mode="before")
@classmethod
def parse_json_string(cls, v):
if isinstance(v, str):
return json.loads(v)
return v
Caching
Responses are cached automatically when temperature=0 or when using a reasoning model. Streaming responses are never cached.
Cache key is derived from: model, prompt (or messages), max_tokens, top_p, system_prompt, response_format, api_type, reasoning_effort.
What is stored:
| Field | Description |
|---|---|
text |
Raw response text |
model |
Model used |
stop_reason |
Stop reason |
input_tokens |
Input token count |
output_tokens |
Output token count |
reasoning_tokens |
Reasoning token count |
cached_tokens |
Prompt cache token count |
timestamp |
ISO 8601 UTC timestamp of the original API call |
elapsed_seconds |
Duration of the original API call in seconds |
metadata.prompt |
Original prompt (or messages) — stored in top-level cache metadata, not duplicated in data |
metadata.response_format |
JSON schema of requested output format |
structured_data |
Parsed Pydantic object (as dict) |
timestamp and elapsed_seconds are stored and restored on cache hits — they reflect when the original API call was made and how long it took.
response1 = await client.generate_text(TextRequest(prompt="What is 2+2?", temperature=0))
print(response1.cache_source) # "miss"
response2 = await client.generate_text(TextRequest(prompt="What is 2+2?", temperature=0))
print(response2.cache_source) # "l1" or "l2"
# Force refresh
response3 = await client.generate_text(TextRequest(prompt="What is 2+2?", temperature=0, clear_cache=True))
Development
git clone https://github.com/Redundando/smartllm.git
cd smartllm
pip install -e .[all,dev]
pytest tests/unit/ -v
pytest tests/integration/ --model gpt-4o
License
MIT — see LICENSE.
Issues: GitHub Issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smartllm-0.1.14.tar.gz.
File metadata
- Download URL: smartllm-0.1.14.tar.gz
- Upload date:
- Size: 79.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1018e28d3e420b1fb7dc0d50aa1cda7f8f32f6f9d53cd6a2d1fcedbe8c616b0
|
|
| MD5 |
c2ac117156be64c6ad9355214d3f03c5
|
|
| BLAKE2b-256 |
f6e1efc2454b8beabd69235ecaba762d966b99082c6a8bf1036c0d60165c3048
|
File details
Details for the file smartllm-0.1.14-py3-none-any.whl.
File metadata
- Download URL: smartllm-0.1.14-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f9b15bce37651328cdf9f37d701a136de1e412376f0c750909df1a210306723
|
|
| MD5 |
748bc109ab067274280c68a44a74d203
|
|
| BLAKE2b-256 |
43ede85e12d0dd79bb9a0c82fb5634251f77b7e4d5b0ce39e7d7030a91f58bd9
|