A unified async Python wrapper for multiple LLM providers with OpenAI Response API and reasoning support

These details have not been verified by PyPI

Project links

Project description

SmartLLM

A unified async Python wrapper for multiple LLM providers with a consistent interface.

Features

Unified Interface — Single API for OpenAI and AWS Bedrock
Async/Await — Built on asyncio for concurrent requests
Smart Caching — Two-level cache (local JSON + optional DynamoDB)
Auto Retry — Exponential backoff for transient failures
Structured Output — Native Pydantic model support
Streaming — Real-time streaming responses
Streaming with Assembly — Internal streaming that returns a single TextResponse (solves Bedrock read timeouts on large requests)
Rate Limiting — Built-in concurrency control per model
Reasoning Models — Full support including reasoning_effort and reasoning_tokens
Extended Thinking (Bedrock) — Claude extended thinking with two-pass structured output
Progress Callbacks — Optional on_progress for real-time events (including retries)
Configurable Timeouts — Adjustable HTTP read/connect timeouts for Bedrock (default 300s read)

Installation

pip install smartllm[openai]   # OpenAI only
pip install smartllm[bedrock]  # AWS Bedrock only
pip install smartllm[all]      # All providers

Quick Start

import asyncio
from smartllm import LLMClient, TextRequest

async def main():
    async with LLMClient(provider="openai") as client:
        response = await client.generate_text(
            TextRequest(prompt="What is the capital of France?")
        )
        print(response.text)

asyncio.run(main())

Configuration

Environment Variables

OpenAI:

export OPENAI_API_KEY="your-api-key"
export OPENAI_MODEL="gpt-4o-mini"  # optional

AWS Bedrock:

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"
export BEDROCK_MODEL="anthropic.claude-3-sonnet-20240229-v1:0"  # optional
export BEDROCK_READ_TIMEOUT="300"   # HTTP read timeout in seconds (default: 300)
export BEDROCK_CONNECT_TIMEOUT="10" # HTTP connect timeout in seconds (default: 10)

Explicit credentials are optional. If omitted, boto3's default credential chain is used — including EC2 instance profiles, ECS task roles, Lambda execution roles, and ~/.aws/credentials.

Programmatic Configuration

from smartllm import LLMClient, LLMConfig

config = LLMConfig(
    provider="openai",
    api_key="your-api-key",
    default_model="gpt-4o",
    temperature=0.7,
    max_tokens=2048,
    max_retries=3,
)

async with LLMClient(config) as client:
    ...

Usage Examples

Multi-turn Conversations

from smartllm import LLMClient, MessageRequest, Message

async with LLMClient(provider="openai") as client:
    messages = [
        Message(role="user", content="My name is Alice."),
        Message(role="assistant", content="Nice to meet you, Alice!"),
        Message(role="user", content="What's my name?"),
    ]
    response = await client.send_message(MessageRequest(messages=messages))
    print(response.text)  # "Your name is Alice."

Structured Output

from pydantic import BaseModel
from smartllm import LLMClient, TextRequest

class Person(BaseModel):
    name: str
    age: int

async with LLMClient(provider="openai") as client:
    response = await client.generate_text(
        TextRequest(prompt="Return a person named John, age 30.", response_format=Person)
    )
    print(response.structured_data.name)  # "John"

Streaming

async with LLMClient(provider="openai") as client:
    async for chunk in client.generate_text_stream(
        TextRequest(prompt="Write a short poem.", stream=True)
    ):
        print(chunk.text, end="", flush=True)

Streaming with Assembly (Bedrock only)

generate_text_streamed uses Bedrock's streaming API internally but returns a fully assembled TextResponse — identical to generate_text(). This solves read timeouts on large requests (50K+ input, 16K+ output tokens) where the non-streaming invoke_model connection idles and times out.

async with LLMClient(provider="bedrock") as client:
    response = await client.generate_text_streamed(
        TextRequest(
            prompt="Write a 5000-word technical analysis...",
            max_tokens=8192,
            temperature=0,
        )
    )
    # Returns a normal TextResponse — no chunk iteration needed
    print(response.text)
    print(f"Tokens: {response.input_tokens} in, {response.output_tokens} out")

When to use generate_text_streamed vs generate_text:

Scenario	Method
Short requests (< 30K chars input, < 4K tokens output)	`generate_text`
Large requests that risk read timeout (long generation time)	`generate_text_streamed`
Need structured output (`response_format`)	`generate_text` (streamed rejects this)
Need progress visibility during long generation	`generate_text_streamed`
OpenAI provider	`generate_text` (streamed is Bedrock-only)

Behavior:

Same TextResponse shape as generate_text (text, model, tokens, metadata, cache)
Same cache keys — a response cached by one method is served to the other
Same semaphore, retry logic, and concurrency gating
Fires progress events: llm_started, stream_progress, stream_thinking, llm_done, error, retry, cache_hit
Raises ValueError if response_format is set (suggests generate_text as alternative)
Raises NotImplementedError on OpenAI provider

Progress events during streaming:

def on_progress(event):
    if event["event"] == "stream_progress":
        print(f"{event['text_tokens_so_far']} tokens generated...")
    elif event["event"] == "stream_thinking":
        print(f"{event['thinking_tokens_so_far']} thinking tokens...")

response = await client.generate_text_streamed(
    TextRequest(prompt="...", on_progress=on_progress)
)

stream_progress and stream_thinking fire every ~500 estimated tokens or every 10 seconds (whichever comes first). Token count is estimated as len(text) // 4.

Event	Fields
`stream_progress`	`text_tokens_so_far`, `text_so_far`, `elapsed_seconds`
`stream_thinking`	`thinking_tokens_so_far`, `thinking_text_so_far`, `elapsed_seconds`

Reasoning Models

response = await client.generate_text(
    TextRequest(
        prompt="Solve: what is the 100th Fibonacci number?",
        reasoning_effort="high",  # "low", "medium", or "high"
    )
)
print(response.text)
print(f"Reasoning tokens: {response.reasoning_tokens}")

Note: reasoning models do not support temperature. Passing a value other than 1 raises ValueError.

Extended Thinking (Bedrock/Claude)

Claude models on Bedrock support extended thinking, giving the model a token budget to reason step-by-step before answering.

async with LLMClient(provider="bedrock") as client:
    # Using reasoning_effort (maps to budget: low=1024, medium=4096, high=16000)
    response = await client.generate_text(
        TextRequest(
            prompt="Analyze the tradeoffs of event sourcing vs CRUD.",
            model="eu.anthropic.claude-sonnet-4-6",
            reasoning_effort="high",
        )
    )
    print(response.text)
    print(f"Reasoning tokens: {response.reasoning_tokens}")
    print(f"Thinking: {response.metadata.get('thinking', '')[:200]}")

For precise control, use budget_tokens directly (overrides reasoning_effort):

response = await client.generate_text(
    TextRequest(
        prompt="Solve this step by step...",
        model="eu.anthropic.claude-sonnet-4-6",
        budget_tokens=8192,  # Explicit token budget (minimum 1024)
    )
)

Extended Thinking + Structured Output

When both reasoning_effort (or budget_tokens) and response_format are set, SmartLLM uses a two-pass approach:

Pass 1 — Sends the prompt with extended thinking enabled. Claude reasons through the problem and produces a text answer.
Pass 2 — Sends the text answer to a second call with forced tool use to extract it into the Pydantic model.

from pydantic import BaseModel
from typing import List

class Analysis(BaseModel):
    topic: str
    pros: List[str]
    cons: List[str]
    recommendation: str

response = await client.generate_text(
    TextRequest(
        prompt="Should we use microservices or a monolith?",
        model="eu.anthropic.claude-sonnet-4-6",
        reasoning_effort="medium",
        response_format=Analysis,
    )
)
print(response.structured_data.recommendation)
print(response.metadata["pass1_tokens"])  # {"input": ..., "output": ...}
print(response.metadata["pass2_tokens"])  # {"input": ..., "output": ...}

The two-pass approach is needed because Claude's extended thinking is incompatible with forced tool use (tool_choice: {"type": "tool"}). The result is cached as a single entry — on cache hit, both passes are skipped.

Streaming with Extended Thinking

When streaming with thinking enabled, thinking chunks are yielded with metadata={"type": "thinking"}:

async for chunk in client.generate_text_stream(
    TextRequest(prompt="Explain quantum entanglement.", reasoning_effort="medium", stream=True)
):
    if chunk.metadata.get("type") == "thinking":
        print(f"[thinking] {chunk.text}", end="")
    else:
        print(chunk.text, end="")

OpenAI API Types

# Responses API (default, recommended)
TextRequest(prompt="Hello", api_type="responses")

# Chat Completions API (legacy)
TextRequest(prompt="Hello", api_type="chat_completions")

Concurrent Requests

tasks = [client.generate_text(TextRequest(prompt=p)) for p in prompts]
responses = await asyncio.gather(*tasks)

Progress Callbacks

async def on_progress(event):
    print(event)

response = await client.generate_text(
    TextRequest(prompt="Hello", on_progress=on_progress)
)

Events: llm_started, llm_done, cache_hit (with cache_source, cache_key), retry, error (with message). Each event dict includes event, ts, prompt, model, provider. llm_done and cache_hit also include input_tokens, output_tokens, reasoning_tokens, cached_tokens.

The retry event is emitted before each retry attempt and includes:

Field	Description
`event`	`"retry"`
`attempt`	Current retry number (1-indexed)
`max_retries`	Total retries configured
`error`	Exception class name (e.g. `"ReadTimeoutError"`)
`error_message`	Full error string
`model`	Model being called
`max_tokens`	Max tokens for this request
`delay`	Seconds until next attempt

DynamoDB Caching

async with LLMClient(provider="openai", dynamo_table_name="my-llm-cache") as client:
    ...

Requires AWS credentials with DynamoDB access. Table is auto-created if it doesn't exist.

Provider-Specific Clients

from smartllm.openai import OpenAILLMClient, OpenAIConfig
from smartllm.bedrock import BedrockLLMClient, BedrockConfig

async with OpenAILLMClient(OpenAIConfig(api_key="...")) as client:
    models = await client.list_available_models()

async with BedrockLLMClient(BedrockConfig(aws_region="us-east-1", read_timeout=300)) as client:
    models = await client.list_available_model_ids()

API Reference

TextRequest Parameters

Parameter	Type	Description	Default
`prompt`	str	Input text prompt	Required
`model`	str	Model ID	Config default
`temperature`	float	Sampling temperature (0–1)	0
`max_tokens`	int	Maximum output tokens	2048
`top_p`	float	Nucleus sampling	1.0
`system_prompt`	str	System context	None
`stream`	bool	Enable streaming	False
`response_format`	BaseModel	Pydantic model for structured output	None
`use_cache`	bool	Enable caching	True
`clear_cache`	bool	Clear cache before request	False
`api_type`	str	`"responses"` or `"chat_completions"`	`"responses"`
`reasoning_effort`	str	`"low"`, `"medium"`, or `"high"`	None
`budget_tokens`	int	Explicit thinking budget in tokens (Bedrock/Claude). Overrides `reasoning_effort` mapping. Minimum 1024.	None
`on_progress`	Callable	Progress event callback (sync or async)	None

TextResponse Fields

Field	Type	Description
`text`	str	Generated text
`model`	str	Model that generated the response
`stop_reason`	str	Reason generation stopped
`input_tokens`	int	Input token count
`output_tokens`	int	Output token count
`reasoning_tokens`	int	Reasoning/thinking tokens used (OpenAI reasoning models and Bedrock extended thinking)
`cached_tokens`	int	Prompt cache tokens (OpenAI only, `0` otherwise)
`timestamp`	str \| None	ISO 8601 UTC timestamp of the original API call
`elapsed_seconds`	float \| None	Duration of the original API call in seconds
`metadata`	dict	Request context: `prompt`/`messages` and `response_format` JSON schema
`structured_data`	BaseModel \| None	Parsed Pydantic object (when `response_format` was set)
`cache_source`	str	`"miss"`, `"l1"` (local), or `"l2"` (DynamoDB)
`cache_key`	str \| None	Cache key for this request

Structured Output Error Handling

When using response_format, two error conditions are raised explicitly:

Truncated output — if the provider cuts off the response before the structured output is complete, a ValueError is raised:

try:
    response = await client.generate_text(
        TextRequest(prompt="...", response_format=MyModel, max_tokens=100)
    )
except ValueError as e:
    print(e)  # "Bedrock truncated structured output (stop_reason=max_tokens)"
             # "OpenAI truncated structured output (finish_reason=length)"
             # "OpenAI truncated structured output (status=incomplete)"

Increase max_tokens to avoid this.

Provider serialization quirks — Bedrock occasionally returns list fields as JSON strings rather than inline arrays. Pydantic's model_validate is used internally to handle coercion where possible. If your model has list fields and you still see ValidationError, add a field validator:

import json
from pydantic import BaseModel, field_validator

class BookList(BaseModel):
    books: list[str]

    @field_validator("books", mode="before")
    @classmethod
    def parse_json_string(cls, v):
        if isinstance(v, str):
            return json.loads(v)
        return v

Caching

Responses are cached automatically when temperature=0, when using a reasoning model, or when extended thinking is enabled. Streaming responses (generate_text_stream) are never cached. generate_text_streamed responses are cached — they share the same cache keys as generate_text.

Cache key is derived from: model, prompt (or messages), max_tokens, top_p, system_prompt, response_format, api_type, reasoning_effort, budget_tokens.

What is stored:

Field	Description
`text`	Raw response text
`model`	Model used
`stop_reason`	Stop reason
`input_tokens`	Input token count
`output_tokens`	Output token count
`reasoning_tokens`	Reasoning token count
`cached_tokens`	Prompt cache token count
`timestamp`	ISO 8601 UTC timestamp of the original API call
`elapsed_seconds`	Duration of the original API call in seconds
`metadata.prompt`	Original prompt (or `messages`) — stored in top-level cache metadata, not duplicated in data
`metadata.response_format`	JSON schema of requested output format
`structured_data`	Parsed Pydantic object (as dict)

timestamp and elapsed_seconds are stored and restored on cache hits — they reflect when the original API call was made and how long it took.

response1 = await client.generate_text(TextRequest(prompt="What is 2+2?", temperature=0))
print(response1.cache_source)  # "miss"

response2 = await client.generate_text(TextRequest(prompt="What is 2+2?", temperature=0))
print(response2.cache_source)  # "l1" or "l2"

# Force refresh
response3 = await client.generate_text(TextRequest(prompt="What is 2+2?", temperature=0, clear_cache=True))

Development

git clone https://github.com/Redundando/smartllm.git
cd smartllm
pip install -e .[all,dev]

pytest tests/unit/ -v
pytest tests/integration/ --model gpt-4o

License

MIT — see LICENSE.
Issues: GitHub Issues

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.25

Jun 17, 2026

0.1.24

Jun 12, 2026

0.1.22

Jun 12, 2026

This version

0.1.20

Jun 9, 2026

0.1.19

Jun 9, 2026

0.1.18

Jun 6, 2026

0.1.17

Jun 5, 2026

0.1.16

May 15, 2026

0.1.15

May 7, 2026

0.1.14

Mar 9, 2026

0.1.13

Feb 28, 2026

0.1.12

Feb 24, 2026

0.1.11

Feb 24, 2026

0.1.10

Feb 24, 2026

0.1.9

Feb 24, 2026

0.1.8

Feb 24, 2026

0.1.7

Feb 22, 2026

0.1.6

Feb 21, 2026

0.1.4

Feb 20, 2026

0.1.3

Feb 18, 2026

0.1.2

Feb 18, 2026

0.1.1

Feb 18, 2026

0.1.0

Feb 18, 2026

0.0.8

Dec 4, 2025

0.0.7

Mar 27, 2025

0.0.6

Mar 21, 2025

0.0.5

Mar 21, 2025

0.0.4

Mar 11, 2025

0.0.3

Mar 10, 2025

0.0.2

Mar 10, 2025

0.0.1

Mar 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartllm-0.1.20.tar.gz (91.1 kB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smartllm-0.1.20-py3-none-any.whl (41.6 kB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file smartllm-0.1.20.tar.gz.

File metadata

Download URL: smartllm-0.1.20.tar.gz
Upload date: Jun 9, 2026
Size: 91.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for smartllm-0.1.20.tar.gz
Algorithm	Hash digest
SHA256	`f739c3ebf704cd9a70e6542d4098a3f117266e7b5ef964098d3abfe9ecfa0f5a`
MD5	`985dc27fbfac6e69399c109cda3e3788`
BLAKE2b-256	`ec21c45b44363eeafb79895e194cdb2fe494ef78f045a79acfa65487c5d60ca6`

See more details on using hashes here.

File details

Details for the file smartllm-0.1.20-py3-none-any.whl.

File metadata

Download URL: smartllm-0.1.20-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 41.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for smartllm-0.1.20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d8899c62b847e535257c23f7c856e82670cc421d108253105320ac509fd9a69`
MD5	`c964ad6d8d6787bbd7838459b7081dd4`
BLAKE2b-256	`8aaf99d89e72bf2bb1d8bb194b85b52bcab987e0ec0af26ac7de5b752e3cd6b8`

See more details on using hashes here.

smartllm 0.1.20

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SmartLLM

Features

Installation

Quick Start

Configuration

Environment Variables

Programmatic Configuration

Usage Examples

Multi-turn Conversations

Structured Output

Streaming

Streaming with Assembly (Bedrock only)

Reasoning Models

Extended Thinking (Bedrock/Claude)

Extended Thinking + Structured Output

Streaming with Extended Thinking

OpenAI API Types

Concurrent Requests

Progress Callbacks

DynamoDB Caching

Provider-Specific Clients

API Reference

TextRequest Parameters

TextResponse Fields

Structured Output Error Handling

Caching

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes