A high-level Python SDK for Large Language Models with automatic tool execution, structured output support, and multi-agent workflows
Project description
GlueLLM
TL;DR: I got tired of writing the same LLM boilerplate over and over, so I built this. It's opinionated, has sensible defaults (for me, at least), and gets out of your way. If you like it too, awesome! If not, that's cool—there are plenty of other libraries out there.
What is this?
GlueLLM is a high-level Python SDK that makes working with Large Language Models actually pleasant. Instead of manually managing tool calls, conversations, retries, and all that jazz, you just tell it what you want and it handles the rest.
The vibe: This is what I call a "vibe-coded" library—built iteratively based on real-world needs rather than some grand architectural vision. That said, it's currently running in production projects (including my own), so it's battle-tested enough to be useful.
Why might you like it?
- 🎯 Zero ceremony - No boilerplate, no manual tool loops, just results
- 🔄 Works everywhere - One API for OpenAI, Anthropic, XAI, and others (thanks to
any-llm-sdk) - 🤖 Multi-agent workflows - 16 pre-built patterns when you need to get fancy
- 🛡️ Smart defaults - Automatic retries, proper error handling, and logging that doesn't suck
- 📊 Structured output - Type-safe responses with Pydantic (because we're not animals)
- 🔢 Embeddings support - Generate embeddings with the same error handling and observability
Why might you not?
- This isn't a low-level client—if you need fine-grained control over every API parameter, you might want something else
- It's opinionated about how LLM interactions should work
- Documentation is thorough but might reference my sense of humor occasionally
Want to help? Found a bug? Have a feature idea? Please open an issue! This is a living project that gets better when people actually use it and tell me what's broken or missing.
Features
- 🔄 Automatic Tool Execution Loop - No manual tool call handling required
- 📊 Structured Output - Type-safe responses with Pydantic models
- 💬 Conversation Management - Built-in multi-turn conversation support
- 🛠️ Multiple Tools - Easy integration of multiple tools (sync and async)
- 🎯 Simple API - Clean, intuitive interface for common LLM tasks
- 🔌 Provider Agnostic - Built on
any-llm-sdkfor multi-provider support - ⚡ Automatic Retry with Exponential Backoff - Smart retry logic for rate limits and connection issues
- 🛡️ Comprehensive Error Handling - Catch and classify errors from any LLM provider
- 📝 Enhanced Logging - Track retry attempts and tool execution errors with correlation IDs
- ⚙️ Flexible Configuration - Environment-based settings with pydantic-settings
- 🤖 Multi-Agent Workflows - Orchestrate multiple agents in iterative, pipeline, and debate patterns
- 🌊 Streaming Support - Real-time response streaming for better UX
- 📈 Token Usage Tracking - Monitor token consumption and costs
- 🔍 OpenTelemetry Tracing - Distributed tracing with MLflow for observability
- 🚀 Local Batch Processing - Efficient concurrent processing with automatic retry and error handling
- 🚦 Smart Rate Limiting - Transparent rate limiting with API key pool management for batch operations
Installation
# Using uv (recommended)
uv pip install -e .
# With development dependencies
uv pip install -e ".[dev]"
# Requirements
# Python 3.12+
# See pyproject.toml for full dependencies
Quick Start
Simple Completion
import asyncio
from gluellm.api import complete
async def main():
result = await complete(
user_message="What is the capital of France?",
system_prompt="You are a helpful geography assistant.",
)
print(result.final_response)
asyncio.run(main())
Tool Execution
import asyncio
from gluellm.api import complete
def get_weather(location: str, unit: str = "celsius") -> str:
"""Get the current weather for a location.
Args:
location: City and country, e.g. "Tokyo, Japan"
unit: Temperature unit ("celsius" or "fahrenheit")
"""
return f"Weather in {location}: 22°{unit[0].upper()}, sunny"
async def main():
result = await complete(
user_message="What's the weather in Tokyo and Paris?",
system_prompt="You are a weather assistant. Use get_weather for queries.",
tools=[get_weather],
)
print(result.final_response)
print(f"Tool calls made: {result.tool_calls_made}")
asyncio.run(main())
Structured Output
import asyncio
from gluellm.api import structured_complete
from pydantic import BaseModel, Field
from typing import Annotated
class PersonInfo(BaseModel):
name: Annotated[str, Field(description="Full name")]
age: Annotated[int, Field(description="Age in years")]
city: Annotated[str, Field(description="City of residence")]
async def main():
person = await structured_complete(
user_message="Extract info: John Smith, 35, lives in Seattle",
response_format=PersonInfo,
)
print(f"{person.name} is {person.age} years old and lives in {person.city}")
asyncio.run(main())
Provider Support via any-llm-sdk
GlueLLM is built on top of any-llm-sdk, which provides a unified interface to multiple LLM providers. This means you can easily switch between providers without changing your code.
Supported Providers
- OpenAI - GPT-4, GPT-4o, GPT-3.5, etc.
- Anthropic - Claude 3.5 Sonnet, Claude 3 Opus, etc.
- XAI - Grok models
- And more - any-llm-sdk supports multiple providers
Model Specification Format
Models are specified using the format: provider:model_name
# OpenAI models
model="openai:gpt-4o-mini"
model="openai:gpt-4o"
model="openai:gpt-4-turbo"
# Anthropic models
model="anthropic:claude-3-5-sonnet-20241022"
model="anthropic:claude-3-opus-20240229"
# XAI models
model="xai:grok-beta"
API Key Configuration
Each provider requires its own API key. Set them via environment variables:
# OpenAI
export OPENAI_API_KEY=sk-...
# Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
# XAI
export XAI_API_KEY=xai-...
Alternatively, use the GLUELLM_ prefixed variables:
export GLUELLM_OPENAI_API_KEY=sk-...
export GLUELLM_ANTHROPIC_API_KEY=sk-ant-...
export GLUELLM_XAI_API_KEY=xai-...
Switching Providers
Switching between providers is as simple as changing the model string:
# Use OpenAI
result = await complete(
user_message="Hello!",
model="openai:gpt-4o-mini",
)
# Switch to Anthropic
result = await complete(
user_message="Hello!",
model="anthropic:claude-3-5-sonnet-20241022",
)
# Switch to XAI
result = await complete(
user_message="Hello!",
model="xai:grok-beta",
)
All features (tool calling, structured output, conversations) work consistently across providers thanks to any-llm-sdk's unified interface.
Default Provider
Set a default provider in your configuration to avoid specifying it every time:
# .env file
GLUELLM_DEFAULT_MODEL=openai:gpt-4o-mini
from gluellm.api import complete
# Uses default model from config
result = await complete("Hello!")
Why any-llm-sdk?
- Unified API - Same code works with any provider
- Feature Parity - Consistent tool calling and structured output
- Easy Migration - Switch providers without code changes
- Future-Proof - New providers added to any-llm-sdk work automatically
For more details on any-llm-sdk capabilities and supported providers, see the any-llm-sdk documentation.
Core API
GlueLLM Client
The main client class for stateful LLM interactions with conversation memory.
from gluellm.api import GlueLLM
client = GlueLLM(
model="openai:gpt-4o-mini", # Model in "provider:model" format
system_prompt="You are helpful.", # System prompt (optional)
tools=[my_tool], # List of tools (optional)
max_tool_iterations=10, # Max tool execution loops (optional)
)
# Make completions with conversation memory
result = await client.complete("My favorite color is blue")
result = await client.complete("What's my favorite color?") # Remembers context
# Structured completions
data = await client.structured_complete("Extract data...", response_format=MyModel)
# Reset conversation history
client.reset_conversation()
Convenience Functions
For one-off requests without maintaining conversation state:
from gluellm.api import complete, structured_complete, embed
# Simple completion
result = await complete(
user_message="What is 2+2?",
model="openai:gpt-4o-mini", # Optional, defaults to config
system_prompt="You are helpful.", # Optional
tools=[calculator], # Optional
execute_tools=True, # Optional, default True
max_tool_iterations=10, # Optional
)
# Structured completion
person = await structured_complete(
user_message="Extract: Jane Doe, age 28",
response_format=PersonInfo,
model="openai:gpt-4o-mini", # Optional
system_prompt="Extract data.", # Optional
)
ToolExecutionResult
The result object returned by complete() and client.complete():
class ToolExecutionResult:
final_response: str # Final text response from model
tool_calls_made: int # Number of tool calls executed
tool_execution_history: list[dict] # Detailed tool call history
raw_response: ChatCompletion | None # Raw response from any-llm-sdk
tokens_used: dict[str, int] | None # Token usage info (prompt, completion, total)
# Access result data
result = await complete("Query with tools...", tools=[my_tool])
print(result.final_response)
print(f"Made {result.tool_calls_made} tool calls")
# Token usage tracking
if result.tokens_used:
print(f"Tokens used: {result.tokens_used['total']}")
print(f" Prompt: {result.tokens_used['prompt']}")
print(f" Completion: {result.tokens_used['completion']}")
for exec_info in result.tool_execution_history:
print(f"{exec_info['tool_name']}({exec_info['arguments']}) -> {exec_info['result']}")
Tool Execution
How It Works
GlueLLM automatically handles the tool execution loop:
- Send user message to LLM
- If LLM wants to call a tool, execute it
- Send tool result back to LLM
- Repeat until LLM provides final answer or max iterations reached
No manual tool call management required!
sequenceDiagram
participant User
participant GlueLLM
participant LLM
participant Tool
User->>GlueLLM: complete(message, tools)
GlueLLM->>LLM: Send message
LLM->>GlueLLM: Tool call request
GlueLLM->>Tool: Execute tool
Tool->>GlueLLM: Return result
GlueLLM->>LLM: Send tool result
LLM->>GlueLLM: Final response
GlueLLM->>User: ToolExecutionResult
Creating Tools
Tools are simple Python functions with docstrings:
def calculate(expression: str) -> str:
"""Safely evaluate a mathematical expression.
Args:
expression: A mathematical expression like "2 + 2" or "10 * 5"
Returns:
The result of the calculation
"""
try:
result = eval(expression) # Use safely in production!
return str(result)
except Exception as e:
return f"Error: {str(e)}"
Important:
- Function name becomes the tool name
- Docstring is used by the LLM to understand tool purpose
- Type hints help the LLM understand parameters
- Return string values for best results
- Tools can be synchronous or asynchronous (use
async deffor async tools)
Multiple Tools Example
def get_weather(location: str) -> str:
"""Get current weather for a location."""
return f"Weather in {location}: 22°C, sunny"
def get_time(timezone: str = "UTC") -> str:
"""Get current time in a timezone."""
from datetime import datetime
return datetime.now().strftime("%H:%M:%S")
result = await complete(
user_message="What's the weather in Tokyo and what time is it there?",
tools=[get_weather, get_time],
)
Async Tools
Tools can be asynchronous for database queries, API calls, etc.:
import aiohttp
async def fetch_data(url: str) -> str:
"""Fetch data from a URL."""
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
result = await complete(
user_message="Fetch data from https://api.example.com/data",
tools=[fetch_data],
)
GlueLLM automatically detects async tools and awaits them appropriately.
Tool Execution History
Track all tool calls and their results:
result = await complete("Complex query requiring tools...", tools=[tool1, tool2])
for i, exec_info in enumerate(result.tool_execution_history, 1):
print(f"\nTool Call {i}:")
print(f" Tool: {exec_info['tool_name']}")
print(f" Args: {exec_info['arguments']}")
print(f" Result: {exec_info['result']}")
print(f" Error: {exec_info.get('error', False)}")
Embeddings
Generate embeddings for text with automatic error handling, rate limiting, and cost tracking.
Basic Embedding Generation
import asyncio
from gluellm import embed
async def main():
# Single text
result = await embed("Hello, world!")
print(f"Dimension: {result.dimension}") # e.g., 1536 for text-embedding-3-small
print(f"Tokens: {result.tokens_used}")
print(f"Cost: ${result.estimated_cost_usd:.6f}")
# Access the embedding vector
embedding_vector = result.embeddings[0]
print(f"First 5 values: {embedding_vector[:5]}")
asyncio.run(main())
Batch Embeddings
import asyncio
from gluellm import embed
async def main():
# Multiple texts at once
texts = ["Hello", "World", "Python programming"]
result = await embed(texts)
print(f"Generated {result.count} embeddings")
print(f"Each has {result.dimension} dimensions")
# Access individual embeddings
for i, text in enumerate(texts):
embedding = result.get_embedding(i)
print(f"{text}: {embedding[:3]}...") # First 3 dimensions
asyncio.run(main())
Using GlueLLM Client
import asyncio
from gluellm import GlueLLM
async def main():
# Create client with custom embedding model
client = GlueLLM(
embedding_model="openai/text-embedding-3-large", # Optional
)
# Generate embeddings
result = await client.embed("Your text here")
print(f"Dimension: {result.dimension}")
# Use same client for completions
completion = await client.complete("Summarize this")
print(completion.final_response)
asyncio.run(main())
EmbeddingResult Model
The EmbeddingResult provides convenient access to embedding data:
from gluellm import embed
result = await embed("Hello")
# Properties
result.embeddings # List of embedding vectors
result.model # Model used (e.g., "openai/text-embedding-3-small")
result.tokens_used # Total tokens consumed
result.estimated_cost_usd # Estimated cost in USD
result.dimension # Dimension of embedding vectors
result.count # Number of embeddings
# Methods
embedding = result.get_embedding(0) # Get embedding by index
Custom Embedding Model
from gluellm import embed
# Use a different embedding model
result = await embed(
"Your text",
model="openai/text-embedding-3-large", # 3072 dimensions
)
print(f"Dimension: {result.dimension}") # 3072
Configuration
Set default embedding model via environment variable:
export GLUELLM_DEFAULT_EMBEDDING_MODEL=openai/text-embedding-3-small
Or in code:
from gluellm.config import settings
print(settings.default_embedding_model) # "openai/text-embedding-3-small"
Supported Embedding Models
Currently supported embedding models:
openai/text-embedding-3-small(1536 dimensions, $0.02/1M tokens)openai/text-embedding-3-large(3072 dimensions, $0.13/1M tokens)openai/text-embedding-ada-002(1536 dimensions, $0.10/1M tokens)
Error Handling
Embeddings use the same error handling as completions:
from gluellm import embed, RateLimitError, TokenLimitError
try:
result = await embed("Your text")
except RateLimitError:
print("Rate limit hit - will retry automatically")
except TokenLimitError:
print("Text too long - reduce input size")
Structured Output
Get type-safe, validated responses using Pydantic models:
from pydantic import BaseModel, Field
from typing import Annotated
class Product(BaseModel):
name: Annotated[str, Field(description="Product name")]
price: Annotated[float, Field(description="Price in USD")]
in_stock: Annotated[bool, Field(description="Whether item is in stock")]
category: Annotated[str, Field(description="Product category")]
product = await structured_complete(
user_message="Extract: iPhone 15 Pro, $999, available, Electronics",
response_format=Product,
)
# Type-safe access
print(f"{product.name}: ${product.price}")
print(f"In stock: {product.in_stock}")
Complex Structured Output
from typing import List
class Address(BaseModel):
street: str
city: str
country: str
class Person(BaseModel):
name: str
age: int
addresses: List[Address]
occupation: str | None = None
person = await structured_complete(
user_message="""
Extract: Sarah Johnson, 32 years old.
Lives at 123 Main St, New York, USA and 456 Oak Ave, Boston, USA.
Works as a software engineer.
""",
response_format=Person,
)
Batch Processing
Process multiple LLM requests efficiently with controlled concurrency, automatic retry, comprehensive error handling, and intelligent API key rotation.
Note: For high-volume batch operations, see the Rate Limiting section to learn about API key pools that automatically manage rate limits across multiple keys.
Simple Batch Processing
Process a list of messages with automatic concurrency control:
import asyncio
from gluellm import batch_complete_simple, BatchConfig
async def main():
messages = [
"What is 2+2?",
"What is the capital of France?",
"Explain quantum computing in one sentence.",
]
# Process with concurrency limit of 3
responses = await batch_complete_simple(
messages,
config=BatchConfig(max_concurrent=3),
)
for msg, resp in zip(messages, responses, strict=True):
print(f"Q: {msg}")
print(f"A: {resp}\n")
asyncio.run(main())
Advanced Batch Processing
Use BatchRequest for fine-grained control over each request:
from gluellm import batch_complete, BatchRequest, BatchConfig, BatchErrorStrategy
requests = [
BatchRequest(
id="math-1",
user_message="Calculate 15 * 24",
metadata={"category": "math"},
),
BatchRequest(
id="geography-1",
user_message="What is the capital of Japan?",
timeout=30.0,
metadata={"category": "geography"},
),
BatchRequest(
id="code-1",
user_message="Write a Python function to reverse a string",
model="openai:gpt-4o", # Per-request model override
tools=[my_tool], # Per-request tools
metadata={"category": "coding"},
),
]
response = await batch_complete(
requests,
config=BatchConfig(
max_concurrent=5,
error_strategy=BatchErrorStrategy.CONTINUE, # Continue on errors
retry_failed=True, # Automatically retry failed requests
),
)
# Access results
print(f"Completed: {response.successful_requests}/{response.total_requests}")
print(f"Failed: {response.failed_requests}")
print(f"Total time: {response.total_elapsed_time:.2f}s")
for result in response.results:
if result.success:
print(f"{result.id}: {result.response}")
else:
print(f"{result.id}: Error - {result.error}")
Batch Processing Features
Concurrency Control:
- Limit concurrent requests to avoid rate limits
- Automatic queuing and execution management
- Configurable via
max_concurrentparameter
Error Handling Strategies:
FAIL_FAST: Stop on first errorCONTINUE: Process all requests despite errorsCOLLECT: Gather all errors and raise at end
Automatic Retry:
- Optional retry of failed requests
- Preserves request metadata including retry status
- Configurable retry behavior
Per-Request Configuration:
- Custom timeout per request
- Model override per request
- Tools override per request
- Custom system prompt per request
- Request-specific metadata
Response Tracking:
- Individual request IDs for correlation
- Success/failure status per request
- Detailed error messages
- Elapsed time per request
- Token usage tracking
- Tool call statistics
API Key Pool Management:
- Automatic rotation across multiple API keys
- Per-key rate limiting
- Provider-specific key management
- Maximizes throughput for high-volume operations
- See Rate Limiting for details
Batch Processing with Tools
def calculate(expression: str) -> str:
"""Evaluate a mathematical expression."""
return str(eval(expression))
requests = [
BatchRequest(
user_message="Calculate 25 * 4",
tools=[calculate],
execute_tools=True,
),
BatchRequest(
user_message="What is 100 divided by 5?",
tools=[calculate],
execute_tools=True,
),
]
response = await batch_complete(requests)
Error Handling in Batches
from gluellm import BatchErrorStrategy
# Stop on first error
config = BatchConfig(error_strategy=BatchErrorStrategy.FAIL_FAST)
# Continue processing all requests
config = BatchConfig(error_strategy=BatchErrorStrategy.CONTINUE)
# Process all then raise exception with all errors
config = BatchConfig(error_strategy=BatchErrorStrategy.COLLECT)
response = await batch_complete(requests, config=config)
# Check for failures
if response.failed_requests > 0:
for result in response.results:
if not result.success:
print(f"Request {result.id} failed: {result.error}")
Batch Response
The BatchResponse object provides comprehensive batch operation statistics:
class BatchResponse:
results: list[BatchResult] # All individual results
total_requests: int # Total requests processed
successful_requests: int # Number of successful requests
failed_requests: int # Number of failed requests
total_elapsed_time: float # Total batch processing time
total_tokens_used: dict[str, int] | None # Aggregated token usage
# Access statistics
print(f"Success rate: {response.successful_requests / response.total_requests * 100:.1f}%")
if response.total_tokens_used:
print(f"Total tokens: {response.total_tokens_used['total']}")
Batch Processing with Multiple API Keys
For high-volume batch operations, you can provide a pool of API keys that GlueLLM will automatically rotate through, managing rate limits on a per-key basis:
import asyncio
from gluellm import batch_complete, BatchRequest, BatchConfig
from gluellm.models.batch import APIKeyConfig
async def main():
# Configure multiple API keys with optional per-key rate limits
api_keys = [
APIKeyConfig(
key="sk-proj-key1...",
provider="openai",
requests_per_minute=60, # Optional: custom limit for this key
burst=10, # Optional: burst capacity for this key
),
APIKeyConfig(
key="sk-proj-key2...",
provider="openai",
requests_per_minute=60,
),
APIKeyConfig(
key="sk-proj-key3...",
provider="openai",
),
# You can also include keys for different providers
APIKeyConfig(
key="sk-ant-...",
provider="anthropic",
requests_per_minute=50,
),
]
# Create batch requests
requests = [
BatchRequest(
user_message=f"Analyze document {i}",
model="openai:gpt-4o-mini",
)
for i in range(200)
]
# Process with automatic key rotation
response = await batch_complete(
requests,
config=BatchConfig(
max_concurrent=10,
api_keys=api_keys, # Keys automatically rotated based on rate limits
),
)
print(f"Completed: {response.successful_requests}/{response.total_requests}")
asyncio.run(main())
How it works:
- Each API key has its own rate limiter
- When a key hits its rate limit, GlueLLM automatically switches to the next available key
- If all keys are rate-limited, the system waits for the next available key
- Keys are balanced to distribute load evenly
Benefits:
- Higher Throughput: Process more requests per minute by combining limits across keys
- Automatic Management: No manual key rotation or rate limit tracking needed
- Per-Key Limits: Set custom rate limits for each key based on your tier
- Multi-Provider: Mix keys from different providers in the same pool
When to use:
- Processing hundreds or thousands of requests
- Need throughput beyond a single API key's rate limit
- Want to maximize parallel processing
- Production workloads requiring high reliability
For more details on rate limiting configuration, see the Rate Limiting section.
Example Use Cases
1. Content Generation at Scale:
# Generate multiple article variations
topics = ["AI", "Blockchain", "Quantum Computing"]
requests = [
BatchRequest(user_message=f"Write a brief introduction about {topic}")
for topic in topics
]
2. Data Extraction:
# Extract structured data from multiple sources
documents = load_documents()
requests = [
BatchRequest(
user_message=f"Extract key entities from: {doc}",
metadata={"document_id": doc.id},
)
for doc in documents
]
3. A/B Testing:
# Test different models on same prompts
models = ["openai:gpt-4o-mini", "anthropic:claude-3-5-sonnet-20241022"]
prompts = ["Question 1", "Question 2"]
requests = [
BatchRequest(user_message=prompt, model=model, id=f"{model}-{i}")
for model in models
for i, prompt in enumerate(prompts)
]
4. High-Volume Processing with API Key Pool:
from gluellm.models.batch import APIKeyConfig
# Define multiple API keys for automatic rotation
api_keys = [
APIKeyConfig(key="sk-key1...", provider="openai", requests_per_minute=60),
APIKeyConfig(key="sk-key2...", provider="openai", requests_per_minute=60),
APIKeyConfig(key="sk-key3...", provider="openai", requests_per_minute=60),
]
# Process large datasets efficiently with automatic key rotation
requests = [BatchRequest(user_message=item) for item in large_dataset]
config = BatchConfig(
max_concurrent=10, # Process 10 at a time
error_strategy=BatchErrorStrategy.CONTINUE, # Don't stop on errors
retry_failed=True, # Retry failed requests once
api_keys=api_keys, # GlueLLM automatically manages rate limits per key
)
response = await batch_complete(requests, config=config)
5. Single API Key for Batch:
# If you just need one key (falls back to environment variables if not provided)
config = BatchConfig(
max_concurrent=5,
error_strategy=BatchErrorStrategy.CONTINUE,
)
response = await batch_complete(requests, config=config)
Performance Tips
-
Tune Concurrency: Balance speed vs rate limits
# Conservative (avoid rate limits) config = BatchConfig(max_concurrent=3) # Aggressive (faster, may hit rate limits) config = BatchConfig(max_concurrent=10)
-
Use Request IDs: Track requests for logging and debugging
BatchRequest(id=f"request-{uuid.uuid4()}", ...)
-
Set Timeouts: Prevent slow requests from blocking
BatchRequest(timeout=30.0, ...) # 30 second timeout
-
Monitor Token Usage: Track costs across batch
if response.total_tokens_used: cost = calculate_cost(response.total_tokens_used) print(f"Batch cost: ${cost:.4f}")
-
Use API Key Pools for High Throughput: Rotate multiple keys automatically
from gluellm.models.batch import APIKeyConfig
# Multiple keys = higher total throughput
api_keys = [
APIKeyConfig(key="sk-key1...", provider="openai", requests_per_minute=60),
APIKeyConfig(key="sk-key2...", provider="openai", requests_per_minute=60),
]
config = BatchConfig(api_keys=api_keys, max_concurrent=10)
For more examples, see examples/batch_processing.py.
Rate Limiting
GlueLLM includes smart, transparent rate limiting powered by throttled-py to prevent accidental API abuse and manage request rates efficiently.
Features
- Automatic Rate Limiting - Transparent rate limiting that waits when limits are hit (no manual handling required)
- Configurable Limits - Set global or per-provider rate limits via configuration
- Multiple Backends - In-memory (default) or Redis for distributed rate limiting
- API Key Pool Management - Automatically rotate through multiple API keys with per-key rate limiting
- Sliding Window Algorithm - Smooth rate limiting without sudden drops at window boundaries
Basic Rate Limiting
Rate limiting is enabled by default and works transparently:
import asyncio
from gluellm.api import complete
async def main():
# Rate limiting happens automatically
for i in range(100):
result = await complete(f"Question {i}")
print(f"{i}: {result.final_response}")
# If rate limit is hit, GlueLLM automatically waits and retries
asyncio.run(main())
Configuration
Configure rate limiting via environment variables:
# Enable/disable rate limiting
export GLUELLM_RATE_LIMIT_ENABLED=true
# Set limits
export GLUELLM_RATE_LIMIT_REQUESTS_PER_MINUTE=60 # Default: 60 RPM
export GLUELLM_RATE_LIMIT_BURST=10 # Burst capacity (default: 10)
# Choose backend
export GLUELLM_RATE_LIMIT_BACKEND=memory # "memory" (default) or "redis"
# Redis backend (optional)
export GLUELLM_RATE_LIMIT_REDIS_URL=redis://localhost:6379
Or configure programmatically:
from gluellm.config import settings
# Access current rate limit settings
print(settings.rate_limit_enabled) # True
print(settings.rate_limit_requests_per_minute) # 60
print(settings.rate_limit_burst) # 10
Disabling Rate Limiting
To disable rate limiting entirely:
export GLUELLM_RATE_LIMIT_ENABLED=false
Or in code:
from gluellm.config import settings
settings.rate_limit_enabled = False
API Key Pool for Batch Operations
For high-volume batch processing, use an API key pool to automatically rotate through multiple keys with per-key rate limiting:
import asyncio
from gluellm import batch_complete, BatchRequest, BatchConfig
from gluellm.models.batch import APIKeyConfig
async def main():
# Define API key pool with per-key rate limits
api_keys = [
APIKeyConfig(
key="sk-key1...",
provider="openai",
requests_per_minute=60, # Optional per-key limit
burst=10, # Optional per-key burst
),
APIKeyConfig(
key="sk-key2...",
provider="openai",
requests_per_minute=60,
),
APIKeyConfig(
key="sk-ant-key1...",
provider="anthropic",
requests_per_minute=50,
),
]
# Create batch requests
requests = [
BatchRequest(user_message=f"Question {i}", model="openai:gpt-4o-mini")
for i in range(200)
]
# Process with API key pool
response = await batch_complete(
requests,
config=BatchConfig(
max_concurrent=10,
api_keys=api_keys, # GlueLLM automatically manages key rotation
),
)
print(f"Completed: {response.successful_requests}/{response.total_requests}")
asyncio.run(main())
How it works:
- GlueLLM maintains separate rate limiters for each API key
- When a key hits its rate limit, the system automatically rotates to the next available key
- If all keys are rate-limited, the system waits for the next available key
- Keys are distributed evenly to balance load
Advanced: Custom Rate Limiters
For fine-grained control, use the rate limiter API directly:
from gluellm.rate_limiter import get_rate_limiter, acquire_rate_limit
# Create a custom rate limiter
rate_limiter = get_rate_limiter(
key="my_custom_limiter",
requests_per_minute=30,
burst=5,
backend="memory",
)
# Manually acquire rate limit permit
async def rate_limited_operation():
await acquire_rate_limit("my_custom_limiter", rate_limiter=rate_limiter)
# Your operation here
print("Operation executed")
Redis Backend for Distributed Systems
For distributed applications, use Redis to share rate limits across instances:
# Set Redis backend
export GLUELLM_RATE_LIMIT_BACKEND=redis
export GLUELLM_RATE_LIMIT_REDIS_URL=redis://localhost:6379
from gluellm.rate_limiter import get_rate_limiter
# Rate limiter state is shared across all instances
rate_limiter = get_rate_limiter(
key="shared_limiter",
requests_per_minute=100,
backend="redis",
redis_url="redis://localhost:6379",
)
Benefits of Redis backend:
- Shared rate limits across multiple application instances
- Persistent rate limit state across restarts
- Centralized rate limit management
- Ideal for microservices and distributed systems
API Key Pool Configuration
The APIKeyConfig model allows per-key customization:
from gluellm.models.batch import APIKeyConfig
# Basic key configuration
key_config = APIKeyConfig(
key="sk-...",
provider="openai",
)
# With custom rate limits
key_config = APIKeyConfig(
key="sk-...",
provider="openai",
requests_per_minute=100, # Custom RPM for this key
burst=20, # Custom burst capacity
)
# Multiple providers
key_configs = [
APIKeyConfig(key="sk-...", provider="openai", requests_per_minute=60),
APIKeyConfig(key="sk-ant-...", provider="anthropic", requests_per_minute=50),
APIKeyConfig(key="xai-...", provider="xai", requests_per_minute=40),
]
Rate Limiting Strategies
GlueLLM uses the Sliding Window algorithm, which provides:
- Smooth rate limiting - No sudden resets at window boundaries
- Fair distribution - Requests are evenly distributed over time
- Burst handling - Allows short bursts up to the burst capacity
- Accurate limiting - Precise rate enforcement without over-limiting
Performance Considerations
-
Memory Backend: Fast, suitable for single-instance applications
# Default - no setup required export GLUELLM_RATE_LIMIT_BACKEND=memory
-
Redis Backend: Adds network latency but enables distributed rate limiting
# Requires Redis server export GLUELLM_RATE_LIMIT_BACKEND=redis export GLUELLM_RATE_LIMIT_REDIS_URL=redis://localhost:6379
-
API Key Pool: Maximizes throughput by parallelizing across keys
# Use multiple keys for high-volume workloads config = BatchConfig(api_keys=[key1, key2, key3])
Best Practices
-
Set Conservative Limits: Start with lower limits and increase as needed
export GLUELLM_RATE_LIMIT_REQUESTS_PER_MINUTE=30
-
Use API Key Pools for Batch: Distribute load across multiple keys
config = BatchConfig(api_keys=multiple_keys, max_concurrent=10)
-
Monitor Rate Limit Hits: Enable debug logging to track rate limiting
export GLUELLM_LOG_LEVEL=DEBUG
-
Match Provider Limits: Set limits below your provider's actual limits
# If OpenAI allows 3500 RPM, set to 3000 RPM for safety margin APIKeyConfig(provider="openai", requests_per_minute=3000)
-
Use Redis for Production: For multi-instance deployments, use Redis
export GLUELLM_RATE_LIMIT_BACKEND=redis
Troubleshooting
Rate limits still being hit:
- Check your configured limits against provider limits
- Reduce
requests_per_minuteormax_concurrentsettings - Add more API keys to your pool
Slow batch processing:
- Rate limiting is working correctly - this prevents API errors
- Increase
requests_per_minuteif you have higher provider limits - Add more API keys to increase throughput
Redis connection errors:
- Verify Redis server is running:
redis-cli ping - Check Redis URL is correct
- Ensure network connectivity to Redis server
Multi-turn Conversations
Use the GlueLLM client to maintain conversation context across multiple turns:
from gluellm.api import GlueLLM
# Create client with conversation memory
client = GlueLLM(
model="openai:gpt-4o-mini",
system_prompt="You are a helpful assistant with memory.",
tools=[calculator], # Tools available across conversation
)
# Turn 1
result1 = await client.complete("My favorite number is 42")
print(result1.final_response) # "I'll remember that your favorite number is 42!"
# Turn 2 - Has context from turn 1
result2 = await client.complete("What's my favorite number?")
print(result2.final_response) # "Your favorite number is 42."
# Turn 3 - Can use tools with context
result3 = await client.complete("Calculate my favorite number times 2")
print(result3.final_response) # "42 × 2 = 84"
# Reset conversation to start fresh
client.reset_conversation()
Configuration
GlueLLM uses pydantic-settings for flexible configuration via environment variables or .env files.
Configuration Options
from gluellm.config import settings
# Access current settings
print(settings.default_model) # "openai:gpt-4o-mini"
print(settings.max_tool_iterations) # 10
print(settings.retry_max_attempts) # 3
Environment Variables
All settings can be configured via environment variables with the GLUELLM_ prefix:
# Model settings
export GLUELLM_DEFAULT_MODEL=openai:gpt-4o-mini
export GLUELLM_DEFAULT_EMBEDDING_MODEL=openai/text-embedding-3-small
export GLUELLM_DEFAULT_SYSTEM_PROMPT="You are a helpful assistant."
# Tool execution
export GLUELLM_MAX_TOOL_ITERATIONS=10
# Retry configuration
export GLUELLM_RETRY_MAX_ATTEMPTS=3
export GLUELLM_RETRY_MIN_WAIT=2
export GLUELLM_RETRY_MAX_WAIT=30
export GLUELLM_RETRY_MULTIPLIER=1
# Rate limiting
export GLUELLM_RATE_LIMIT_ENABLED=true # Enable/disable rate limiting
export GLUELLM_RATE_LIMIT_REQUESTS_PER_MINUTE=60 # Default requests per minute
export GLUELLM_RATE_LIMIT_BURST=10 # Burst capacity
export GLUELLM_RATE_LIMIT_BACKEND=memory # Backend: "memory" or "redis"
export GLUELLM_RATE_LIMIT_REDIS_URL=redis://localhost:6379 # Redis URL (if using redis backend)
# Logging
export GLUELLM_LOG_LEVEL=INFO # Console log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
export GLUELLM_LOG_FILE_LEVEL=DEBUG # File log level (typically DEBUG for full details)
export GLUELLM_LOG_DIR=logs # Log directory (default: logs/)
export GLUELLM_LOG_FILE_NAME=gluellm.log # Log file name
export GLUELLM_LOG_JSON_FORMAT=false # Enable JSON structured logging
export GLUELLM_LOG_MAX_BYTES=10485760 # Max log file size before rotation (10MB)
export GLUELLM_LOG_BACKUP_COUNT=5 # Number of backup log files to keep
# API Keys (optional - can also use provider-specific vars)
export GLUELLM_OPENAI_API_KEY=your-key
export GLUELLM_ANTHROPIC_API_KEY=your-key
export GLUELLM_XAI_API_KEY=your-key
Using .env File
Create a .env file in your project root (see env.example):
# .env
GLUELLM_DEFAULT_MODEL=openai:gpt-4o-mini
GLUELLM_DEFAULT_SYSTEM_PROMPT=You are a helpful assistant.
GLUELLM_MAX_TOOL_ITERATIONS=10
GLUELLM_RETRY_MAX_ATTEMPTS=3
GLUELLM_LOG_LEVEL=INFO
Reloading Configuration
from gluellm.config import reload_settings
# Reload after changing .env file
settings = reload_settings()
OpenTelemetry Tracing with MLflow
GlueLLM supports distributed tracing and metrics tracking using OpenTelemetry and MLflow for comprehensive observability of LLM interactions. This feature enables you to:
- Monitor LLM calls with detailed span information
- Track token usage and costs
- Trace tool executions and their results
- Debug complex workflows with visualization
- Analyze performance across multiple calls
- Log metrics automatically - Latency, token usage, success/error rates from any-llm clients
Prerequisites
- Install MLflow and OpenTelemetry dependencies:
pip install mlflow>=3.6.0 opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
- Start MLflow tracking server:
# Start with SQLite backend (required for OTLP ingestion)
mlflow server --backend-store-uri sqlite:///mlflow.db --port 5000
Note: File-based backends do not support OpenTelemetry Protocol (OTLP) ingestion. Use a SQL-based backend (SQLite, PostgreSQL, MySQL).
Configuration
Enable tracing by setting environment variables:
# Enable tracing
export GLUELLM_ENABLE_TRACING=true
# MLflow tracking server
export GLUELLM_MLFLOW_TRACKING_URI=http://localhost:5000
export GLUELLM_MLFLOW_EXPERIMENT_NAME=gluellm
# OpenTelemetry OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:5000/v1/traces
Or use a .env file:
# .env
GLUELLM_ENABLE_TRACING=true
GLUELLM_MLFLOW_TRACKING_URI=http://localhost:5000
GLUELLM_MLFLOW_EXPERIMENT_NAME=gluellm
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:5000/v1/traces
Usage
Once configured, tracing is automatic:
import asyncio
from gluellm.api import complete
from gluellm.telemetry import configure_tracing
async def main():
# Configure tracing on startup
configure_tracing()
# All LLM calls will be automatically traced
result = await complete("What is the capital of France?")
print(result.final_response)
asyncio.run(main())
Viewing Traces
- Open MLflow UI:
http://localhost:5000 - Navigate to the "Traces" tab
- Click on a trace to see detailed execution information:
- LLM call parameters (model, messages count)
- Token usage (prompt, completion, total)
- Tool executions and results
- Error information if any
- Execution duration
Traced Attributes
GlueLLM automatically captures the following trace attributes:
LLM Call Spans:
llm.provider- Provider name (openai, anthropic, etc.)llm.model- Model namellm.messages_count- Number of messages in the requestllm.tools_available- Number of tools availablellm.tokens.prompt- Prompt tokens usedllm.tokens.completion- Completion tokens usedllm.tokens.total- Total tokens usedllm.response.finish_reason- Finish reason (stop, tool_calls, etc.)llm.response.has_tool_calls- Whether response includes tool calls
Tool Execution Spans:
tool.name- Name of the executed tooltool.arg_count- Number of arguments passedtool.success- Whether execution succeededtool.error- Error message if execution failed
Example with Tool Tracing
import asyncio
from gluellm.api import complete
from gluellm.telemetry import configure_tracing
def get_weather(location: str) -> str:
"""Get weather for a location."""
return f"Weather in {location}: Sunny, 72°F"
async def main():
configure_tracing()
result = await complete(
"What's the weather in San Francisco?",
tools=[get_weather]
)
print(result.final_response)
# Check MLflow UI to see both LLM call and tool execution traces
asyncio.run(main())
Advanced: Custom Span Attributes
For custom instrumentation, use the tracing context:
from gluellm.telemetry import trace_llm_call, set_span_attributes
async def custom_llm_call():
with trace_llm_call(
model="openai:gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}],
custom_attribute="custom_value"
) as span:
# Your custom logic
response = await some_llm_call()
# Add custom attributes
set_span_attributes(
span,
custom_metric=42,
user_id="user-123"
)
return response
Example Code
See examples/opentelemetry_tracing.py for comprehensive examples:
python examples/opentelemetry_tracing.py
This example demonstrates:
- Simple completions with tracing
- Tool execution tracing
- Structured output tracing
- Multi-turn conversation tracing
- Error handling with tracing
MLflow Metrics Tracking
In addition to OpenTelemetry tracing, GlueLLM automatically logs metrics to MLflow for each LLM call. Metrics are logged automatically when tracing is enabled:
Automatically Tracked Metrics:
llm.latency_seconds- Time taken for each LLM callllm.tokens.prompt- Prompt tokens usedllm.tokens.completion- Completion tokens usedllm.tokens.total- Total tokens usedllm.success- 1.0 for successful callsllm.error- 1.0 for failed calls
Tracked Parameters:
llm.provider- Provider name (openai, anthropic, etc.)llm.model- Model namellm.finish_reason- Completion finish reasonllm.error_type- Error type if call failedllm.has_tool_calls- Whether response included tool calls
Automatic Run Management:
- If no MLflow run is active, GlueLLM automatically creates a default run named "gluellm_auto_metrics"
- All metrics are logged to the active run
- Metrics accumulate over time, allowing aggregation and analysis in MLflow UI
Manual Run Management:
For more control over run grouping, use the mlflow_run_context:
from gluellm.telemetry import mlflow_run_context, configure_tracing
from gluellm.api import complete
configure_tracing()
# Group multiple calls in a single run
with mlflow_run_context("my_workflow", tags={"experiment": "v1"}):
result1 = await complete("First query")
result2 = await complete("Second query")
# Both calls logged to the same run
Viewing Metrics in MLflow:
- Open MLflow UI:
http://localhost:5000 - Navigate to your experiment
- Click on a run to see:
- Metrics over time (latency, token usage)
- Parameters (model, provider, etc.)
- Aggregated statistics
- Comparison across runs
Benefits
Development:
- Debug complex workflows visually
- Identify performance bottlenecks
- Track token usage and optimize costs
- Understand tool execution patterns
- Monitor metrics automatically without manual instrumentation
Production:
- Monitor system health in real-time
- Track LLM API reliability
- Measure end-to-end latency
- Correlate errors with specific calls
- Automatic metrics collection from any-llm clients
Analysis:
- Compare performance across models
- Analyze token usage patterns
- Optimize system prompts
- Identify frequently used tools
- Aggregate metrics over time for trend analysis
Disabling Tracing
To disable tracing without changing code:
export GLUELLM_ENABLE_TRACING=false
Tracing has minimal overhead when disabled - the code checks a flag before instrumenting.
Logging
GlueLLM includes production-grade logging with colored console output, file logging with rotation, and optional JSON structured logging for log aggregation tools.
Features
- Colored Console Logs - Easy-to-read colored output for development (using
colorlog) - File Logging - Comprehensive DEBUG-level logs written to files for troubleshooting
- Automatic Rotation - Log files rotate by size (default: 10MB, keep 5 backups)
- JSON Format Support - Optional JSON structured logging for production monitoring
- Environment Configuration - Configure via environment variables or settings
Default Configuration
Logging is automatically configured when you import GlueLLM:
- Console: INFO level with colored output
- File: DEBUG level in
logs/gluellm.log - Rotation: 10MB per file, 5 backup files
- Format: Human-readable text (JSON optional)
Configuration
Configure logging via environment variables:
# Console log level
export GLUELLM_LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
# File log level (typically DEBUG for full details)
export GLUELLM_LOG_FILE_LEVEL=DEBUG
# Log file location
export GLUELLM_LOG_DIR=logs # Directory for log files
export GLUELLM_LOG_FILE_NAME=gluellm.log # Log file name
# JSON structured logging (for log aggregation tools)
export GLUELLM_LOG_JSON_FORMAT=true # Enable JSON format
# Log rotation
export GLUELLM_LOG_MAX_BYTES=10485760 # Max file size (10MB)
export GLUELLM_LOG_BACKUP_COUNT=5 # Number of backup files
Or configure programmatically:
from gluellm.config import settings
from gluellm.logging_config import setup_logging
# Configure logging
setup_logging(
log_level="INFO",
log_file_level="DEBUG",
log_dir="logs",
log_file_name="gluellm.log",
log_json_format=False,
log_max_bytes=10 * 1024 * 1024, # 10MB
log_backup_count=5,
force=True, # Force reconfiguration
)
Log Levels
- DEBUG: Detailed information for diagnosing problems (file only by default)
- INFO: General information about program execution (console default)
- WARNING: Something unexpected but handled gracefully
- ERROR: Serious problems that may allow the program to continue
- CRITICAL: Critical errors that may cause the program to abort
Log File Location
By default, log files are written to the logs/ directory in your project root:
logs/
├── gluellm.log # Current log file
├── gluellm.log.1 # Rotated backup 1
├── gluellm.log.2 # Rotated backup 2
└── ...
JSON Structured Logging
Enable JSON format for integration with log aggregation tools (ELK stack, Datadog, etc.):
export GLUELLM_LOG_JSON_FORMAT=true
JSON logs include structured fields for easy parsing:
{
"asctime": "2024-01-15 10:30:45",
"name": "gluellm.api",
"levelname": "INFO",
"message": "LLM call completed",
"model": "openai:gpt-4o-mini",
"latency": 1.234,
"tokens": {"total": 150, "prompt": 100, "completion": 50}
}
Logging Utilities
GlueLLM provides utilities for enhanced logging:
from gluellm.logging_utils import (
log_function_call,
log_async_function_call,
log_timing,
log_operation,
get_logger,
)
# Automatic function logging decorator
@log_async_function_call(log_args=True, log_result=True)
async def my_function(x: int, y: int) -> int:
return x + y
# Timing context manager
with log_timing("database_query"):
result = await db.query(...)
# Operation context manager
with log_operation("file_processing"):
process_file(...)
# Get configured logger
logger = get_logger(__name__)
logger.info("Custom log message")
Example
See examples/logging_example.py for a complete logging demonstration:
python examples/logging_example.py
Advanced APIs
Conversation Management
GlueLLM provides a Conversation class for managing conversation history with fine-grained control:
from gluellm.models.conversation import Conversation, Role, Message
# Create a conversation
conv = Conversation()
# Add messages
conv.add_message(Role.USER, "What is Python?")
conv.add_message(Role.ASSISTANT, "Python is a programming language.")
# Access messages
for msg in conv.messages:
print(f"{msg.role.value}: {msg.content}")
# Convert to dictionary format for API calls
messages_dict = conv.messages_dict
# [{"role": "user", "content": "What is Python?"}, ...]
# Access conversation ID
print(conv.id) # UUID string
Message Roles:
Role.SYSTEM- System/instruction messagesRole.USER- User input messagesRole.ASSISTANT- LLM response messagesRole.TOOL- Tool execution result messages
Direct Message Creation:
from gluellm.models.conversation import Message, Role
import uuid
msg = Message(
id=str(uuid.uuid4()),
role=Role.USER,
content="Hello!"
)
Prompt Templating System
GlueLLM uses Jinja2 templates for system prompt formatting with automatic tool integration:
from gluellm.models.prompt import SystemPrompt
# Create a system prompt
prompt = SystemPrompt(content="You are a helpful assistant.")
# Format with tools (automatically includes tool definitions)
def get_weather(location: str) -> str:
"""Get weather for a location."""
return f"Weather in {location}: 22°C"
formatted = prompt.to_formatted_string(tools=[get_weather])
# Returns XML-formatted prompt with tool definitions
Custom Templates:
The system uses BASE_SYSTEM_PROMPT template which wraps instructions and tools in XML. The template automatically:
- Includes system instructions
- Lists available tools with names and docstrings
- Formats output as compact XML
Template Structure:
<system_prompt>
<system_instructions>
Your instructions here
</system_instructions>
<tools>
<tool>
<name>tool_name</name>
<description>Tool docstring</description>
</tool>
</tools>
</system_prompt>
Request Configuration
For advanced use cases, use RequestConfig to manage request parameters and conversation state:
from gluellm.models.config import RequestConfig
from gluellm.models.prompt import SystemPrompt
from gluellm.models.conversation import Role
from pydantic import BaseModel
class OutputFormat(BaseModel):
answer: str
# Create request configuration
config = RequestConfig(
model="openai:gpt-4o-mini",
system_prompt=SystemPrompt(content="You are helpful."),
response_format=OutputFormat, # Optional structured output
tools=[get_weather], # Optional tools
)
# Add messages to conversation
config.add_message_to_conversation(Role.USER, "What's the weather?")
config.add_message_to_conversation(Role.ASSISTANT, "Let me check...")
# Get formatted conversation (includes system message)
messages = config.get_conversation()
# Ready to pass to LLM API
Use Cases:
- Building custom request pipelines
- Managing conversation state outside GlueLLM client
- Integrating with other LLM libraries
- Advanced prompt engineering
Workflow Configuration Models
Workflows use configuration models for fine-grained control:
IterativeConfig - Controls iterative refinement behavior:
from gluellm.models.workflow import IterativeConfig
config = IterativeConfig(
max_iterations=5, # Maximum refinement rounds
min_quality_score=0.8, # Early stopping threshold (0.0-1.0)
convergence_threshold=0.05, # Convergence detection (optional)
quality_evaluator=my_evaluator_func, # Custom quality scorer (optional)
)
def my_evaluator_func(content: str, feedback: dict) -> float:
"""Custom quality evaluator returning 0.0-1.0."""
# Analyze content and feedback
return 0.85 # Quality score
CriticConfig - Defines critic agents in iterative workflows:
from gluellm.models.workflow import CriticConfig
from gluellm.executors import AgentExecutor
critic_config = CriticConfig(
executor=AgentExecutor(my_critic_agent),
specialty="grammar and clarity", # Critic's focus area
goal="Optimize for readability", # Specific optimization goal
weight=1.0, # Feedback weight (default: 1.0)
)
DebateConfig - Controls debate workflow behavior:
from gluellm.workflows.debate import DebateConfig
config = DebateConfig(
max_rounds=5, # Maximum debate rounds
require_consensus=False, # Whether consensus is required
judge_decides=True, # Whether judge makes final decision
)
Generic Agent
GlueLLM provides a GenericAgent as a starting point for custom agents:
from gluellm.agents.generic import GenericAgent
# Create a generic agent with default configuration
agent = GenericAgent()
# Access agent properties
print(agent.name) # "Generic Agent"
print(agent.description) # "A generic agent that can use any tool"
# Use in workflows
from gluellm.executors import AgentExecutor
executor = AgentExecutor(agent)
Customization:
You can subclass GenericAgent or use Agent directly for more control:
from gluellm.models.agent import Agent
from gluellm.models.prompt import SystemPrompt
custom_agent = Agent(
name="My Custom Agent",
description="Does specific tasks",
system_prompt=SystemPrompt(content="Custom instructions"),
tools=[my_tool],
max_tool_iterations=5,
model="anthropic:claude-3-5-sonnet-20241022", # Optional model override
)
Multi-Agent Workflows
GlueLLM provides 16 powerful workflow patterns for orchestrating multiple agents, from simple single-agent refinement to complex multi-agent coordination patterns.
Iterative Refinement Workflow
Producer agent creates content, critic agents provide parallel feedback, content is refined iteratively.
from gluellm.workflows.iterative import IterativeRefinementWorkflow
from gluellm.models.workflow import IterativeConfig, CriticConfig
from gluellm.executors import AgentExecutor
from gluellm.models.agent import Agent
from gluellm.models.prompt import SystemPrompt
# Create producer (writer)
writer = Agent(
name="Writer",
description="Technical writer",
system_prompt=SystemPrompt(
content="You are a technical writer. Create clear, engaging content."
),
tools=[],
max_tool_iterations=5,
)
# Create critics
grammar_critic = Agent(
name="Grammar Critic",
description="Reviews grammar and clarity",
system_prompt=SystemPrompt(
content="You review grammar, spelling, and readability. Provide specific feedback."
),
tools=[],
max_tool_iterations=5,
)
style_critic = Agent(
name="Style Critic",
description="Reviews writing style",
system_prompt=SystemPrompt(
content="You review writing style, tone, and engagement. Provide specific feedback."
),
tools=[],
max_tool_iterations=5,
)
# Create workflow with multiple critics (execute in parallel)
workflow = IterativeRefinementWorkflow(
producer=AgentExecutor(writer),
critics=[
CriticConfig(
executor=AgentExecutor(grammar_critic),
specialty="grammar and clarity",
goal="Optimize for readability and eliminate errors",
),
CriticConfig(
executor=AgentExecutor(style_critic),
specialty="writing style",
goal="Ensure engaging, well-structured narrative",
),
],
config=IterativeConfig(max_iterations=3),
)
# Execute workflow
result = await workflow.execute("Write a short article about Python async programming")
print(f"Final output after {result.iterations} iterations:")
print(result.final_output)
print(f"\nConverged: {result.metadata.get('converged', False)}")
print(f"Total interactions: {len(result.agent_interactions)}")
Key Features:
- Multiple critics execute in parallel for efficiency
- Each critic has a specialty and goal
- Iterates until convergence or max iterations
- Optional quality evaluator for early stopping
Pipeline Workflow
Agents execute sequentially, output of one becomes input to the next.
from gluellm.workflows.pipeline import PipelineWorkflow
from gluellm.executors import AgentExecutor
from gluellm.models.agent import Agent
from gluellm.models.prompt import SystemPrompt
# Create pipeline stages
researcher = Agent(
name="Researcher",
description="Researches topics",
system_prompt=SystemPrompt(
content="You gather and organize information. Provide comprehensive research."
),
tools=[],
)
writer = Agent(
name="Writer",
description="Writes content",
system_prompt=SystemPrompt(
content="You write clear, engaging content based on research provided."
),
tools=[],
)
editor = Agent(
name="Editor",
description="Edits content",
system_prompt=SystemPrompt(
content="You review and polish written content to improve clarity and quality."
),
tools=[],
)
# Create pipeline
workflow = PipelineWorkflow(
stages=[
("research", AgentExecutor(researcher)),
("write", AgentExecutor(writer)),
("edit", AgentExecutor(editor)),
]
)
# Execute pipeline
result = await workflow.execute("Topic: The benefits of async programming in Python")
print(f"Completed {result.iterations} stages")
print(f"Final output:\n{result.final_output}")
# View stage outputs
for interaction in result.agent_interactions:
print(f"\nStage: {interaction['stage']}")
print(f"Output: {interaction['output'][:100]}...")
Key Features:
- Sequential execution (stage N gets output from stage N-1)
- Named stages for tracking
- Simple and predictable flow
Debate Workflow
Multiple agents debate a topic, optional judge makes final decision.
from gluellm.workflows.debate import DebateWorkflow, DebateConfig
from gluellm.executors import AgentExecutor
from gluellm.models.agent import Agent
from gluellm.models.prompt import SystemPrompt
# Create debaters
pro_agent = Agent(
name="Proponent",
description="Argues in favor",
system_prompt=SystemPrompt(
content="You argue in favor of positions with strong, logical arguments."
),
tools=[],
)
con_agent = Agent(
name="Opponent",
description="Argues against",
system_prompt=SystemPrompt(
content="You argue against positions with strong counter-arguments."
),
tools=[],
)
judge_agent = Agent(
name="Judge",
description="Evaluates arguments",
system_prompt=SystemPrompt(
content="You are an impartial judge. Evaluate arguments and make fair decisions."
),
tools=[],
)
# Create debate workflow
workflow = DebateWorkflow(
participants=[
("Pro", AgentExecutor(pro_agent)),
("Con", AgentExecutor(con_agent)),
],
judge=AgentExecutor(judge_agent),
config=DebateConfig(
max_rounds=3,
judge_decides=True,
),
)
# Execute debate
result = await workflow.execute("Should AI development be regulated by governments?")
print(f"Debate completed in {result.iterations} rounds")
print(f"Judge decision:\n{result.final_output}")
print(f"\nTotal interactions: {len(result.agent_interactions)}")
Key Features:
- Multiple participants in round-robin debate
- Each participant sees previous arguments
- Optional judge for final decision
- Debate history tracking
Reflection Workflow
Single agent critiques and improves its own output through self-reflection.
from gluellm.workflows.reflection import ReflectionWorkflow, ReflectionConfig
from gluellm.executors import AgentExecutor
workflow = ReflectionWorkflow(
generator=AgentExecutor(generator_agent),
reflector=AgentExecutor(reflector_agent), # Optional, defaults to generator
config=ReflectionConfig(max_reflections=3),
)
result = await workflow.execute("Write an article about Python")
Key Features:
- Self-critique and improvement loop
- Optional improvement threshold for early stopping
- Custom reflection prompt templates
Chain of Density Workflow
Iteratively increases content density by adding entities, details, or examples.
from gluellm.workflows.chain_of_density import ChainOfDensityWorkflow, ChainOfDensityConfig
workflow = ChainOfDensityWorkflow(
generator=AgentExecutor(generator_agent),
config=ChainOfDensityConfig(
num_iterations=5,
density_increment="entities", # or "details" or "examples"
preserve_length=True,
),
)
result = await workflow.execute("Summarize this article")
Key Features:
- Progressive density increase
- Optional length preservation
- Multiple increment strategies
Socratic Workflow
Two agents engage in question-answer dialogue to explore topics deeply.
from gluellm.workflows.socratic import SocraticWorkflow, SocraticConfig
workflow = SocraticWorkflow(
questioner=AgentExecutor(questioner_agent),
responder=AgentExecutor(responder_agent),
config=SocraticConfig(
max_exchanges=5,
mode="peer", # or "teacher_student"
synthesis_at_end=True,
),
)
result = await workflow.execute("What is artificial intelligence?")
Key Features:
- Teacher-student or peer-to-peer modes
- Optional synthesis at end
- Deep exploration through questioning
RAG Workflow
Retrieval-Augmented Generation with optional fact verification.
from gluellm.workflows.rag import RAGWorkflow, RAGConfig
def my_retriever(query: str) -> list[dict]:
# Your retrieval logic
return [
{"content": "Relevant context...", "source": "doc1"},
{"content": "More context...", "source": "doc2"},
]
workflow = RAGWorkflow(
retriever=my_retriever,
generator=AgentExecutor(generator_agent),
verifier=AgentExecutor(verifier_agent), # Optional
config=RAGConfig(
max_retrieved_chunks=5,
include_sources=True,
verify_facts=True,
),
)
result = await workflow.execute("What is Python?")
Key Features:
- Custom retrieval function
- Optional fact verification
- Source attribution
- Fallback on no context
Round-Robin Workflow
Multiple agents take turns contributing collaboratively.
from gluellm.workflows.round_robin import RoundRobinWorkflow, RoundRobinConfig
workflow = RoundRobinWorkflow(
agents=[
("Writer", AgentExecutor(writer_agent)),
("Editor", AgentExecutor(editor_agent)),
("Reviewer", AgentExecutor(reviewer_agent)),
],
config=RoundRobinConfig(
max_rounds=3,
contribution_style="extend", # or "refine" or "challenge"
final_synthesis=True,
),
)
result = await workflow.execute("Write an article about AI")
Key Features:
- Sequential contribution rounds
- Multiple contribution styles
- Optional final synthesis
Consensus Workflow
Multiple agents propose solutions and iterate until consensus.
from gluellm.workflows.consensus import ConsensusWorkflow, ConsensusConfig
workflow = ConsensusWorkflow(
proposers=[
("Agent1", AgentExecutor(agent1)),
("Agent2", AgentExecutor(agent2)),
("Agent3", AgentExecutor(agent3)),
],
config=ConsensusConfig(
min_agreement_ratio=0.7,
max_rounds=5,
voting_strategy="majority", # or "unanimous" or "weighted"
),
)
result = await workflow.execute("Design a solution for X")
Key Features:
- Proposal and voting mechanism
- Configurable agreement threshold
- Multiple voting strategies
Hierarchical Workflow
Manager breaks down tasks, workers execute, manager synthesizes.
from gluellm.workflows.hierarchical import HierarchicalWorkflow, HierarchicalConfig
workflow = HierarchicalWorkflow(
manager=AgentExecutor(manager_agent),
workers=[
("Worker1", AgentExecutor(worker1)),
("Worker2", AgentExecutor(worker2)),
],
config=HierarchicalConfig(
max_subtasks=5,
parallel_workers=True,
synthesis_strategy="summarize", # or "concatenate" or "merge"
),
)
result = await workflow.execute("Research and write a report on AI")
Key Features:
- Task decomposition
- Parallel or sequential worker execution
- Multiple synthesis strategies
MapReduce Workflow
Parallel map phase processes chunks, reduce phase aggregates results.
from gluellm.workflows.map_reduce import MapReduceWorkflow, MapReduceConfig
workflow = MapReduceWorkflow(
mapper=AgentExecutor(mapper_agent),
reducer=AgentExecutor(reducer_agent), # Optional, defaults to mapper
config=MapReduceConfig(
chunk_size=1000,
chunk_overlap=100,
max_parallel_chunks=5,
reduce_strategy="summarize", # or "concatenate" or "hierarchical"
),
)
result = await workflow.execute("Process this long document...")
Key Features:
- Automatic chunking
- Parallel processing
- Multiple reduce strategies
ReAct Workflow
Reasoning + Acting pattern with interleaved thought/action steps.
from gluellm.workflows.react import ReActWorkflow, ReActConfig
workflow = ReActWorkflow(
reasoner=AgentExecutor(reasoner_agent),
config=ReActConfig(
max_steps=10,
stop_on_final_answer=True,
),
)
result = await workflow.execute("What is the weather in Paris?")
Key Features:
- Interleaved reasoning and action
- Tool-using pattern
- Automatic final answer detection
Mixture of Experts Workflow
Routes queries to specialized experts and combines outputs.
from gluellm.workflows.mixture_of_experts import (
MixtureOfExpertsWorkflow,
ExpertConfig,
MoEConfig,
)
workflow = MixtureOfExpertsWorkflow(
experts=[
ExpertConfig(
executor=AgentExecutor(math_expert),
specialty="mathematics",
description="Expert in math and calculations",
activation_keywords=["calculate", "math", "equation"],
),
ExpertConfig(
executor=AgentExecutor(code_expert),
specialty="programming",
description="Expert in coding",
activation_keywords=["code", "program"],
),
],
config=MoEConfig(
routing_strategy="keyword", # or "semantic", "all", "top_k"
top_k=2,
combine_strategy="synthesize", # or "concatenate" or "vote"
),
)
result = await workflow.execute("Calculate the factorial of 10")
Key Features:
- Multiple routing strategies
- Specialized expert agents
- Flexible combination methods
Constitutional Workflow
Generates content, checks against principles, revises until satisfied.
from gluellm.workflows.constitutional import (
ConstitutionalWorkflow,
ConstitutionalConfig,
Principle,
)
workflow = ConstitutionalWorkflow(
generator=AgentExecutor(generator_agent),
critic=AgentExecutor(critic_agent), # Optional, defaults to generator
config=ConstitutionalConfig(
principles=[
Principle(
name="harmless",
description="Content should not cause harm",
severity="critical",
),
Principle(
name="helpful",
description="Content should be helpful",
severity="error",
),
],
max_revisions=3,
require_all_pass=True,
),
)
result = await workflow.execute("Write a response about AI safety")
Key Features:
- Principle-based critique
- Severity levels
- Automatic revision loop
Tree of Thoughts Workflow
Explores multiple reasoning paths, evaluates and selects best.
from gluellm.workflows.tree_of_thoughts import (
TreeOfThoughtsWorkflow,
TreeOfThoughtsConfig,
)
workflow = TreeOfThoughtsWorkflow(
thinker=AgentExecutor(thinker_agent),
evaluator=AgentExecutor(evaluator_agent), # Optional, defaults to thinker
config=TreeOfThoughtsConfig(
branching_factor=3,
max_depth=3,
evaluation_strategy="score", # or "vote" or "best_first"
prune_threshold=0.3,
),
)
result = await workflow.execute("Solve this complex problem...")
Key Features:
- Multiple reasoning paths
- Evaluation and pruning
- Best-first exploration
Agents and Executors
Agent Model
An Agent represents a configured LLM with specific capabilities:
from gluellm.models.agent import Agent
from gluellm.models.prompt import SystemPrompt
agent = Agent(
name="Research Assistant",
description="Helps with research tasks",
system_prompt=SystemPrompt(content="You are a research assistant."),
tools=[search_web, summarize_text],
max_tool_iterations=10,
model="openai:gpt-4o-mini", # Optional, defaults to config
)
Executors
Executors run agents or direct LLM queries:
SimpleExecutor - Direct LLM execution:
from gluellm.executors import SimpleExecutor
executor = SimpleExecutor(
model="openai:gpt-4o-mini",
system_prompt="You are helpful.",
tools=[calculator],
max_tool_iterations=5,
)
response = await executor.execute("What is 25 * 4?")
print(response)
AgentExecutor - Execute using a pre-configured Agent:
from gluellm.executors import AgentExecutor
from gluellm.models.agent import Agent
from gluellm.models.prompt import SystemPrompt
agent = Agent(
name="Assistant",
description="Helpful agent",
system_prompt=SystemPrompt(content="You are helpful."),
tools=[],
)
executor = AgentExecutor(agent=agent)
response = await executor.execute("Hello!")
print(response)
Use Cases:
SimpleExecutor: Quick, one-off tasks with minimal setupAgentExecutor: Workflows with pre-configured, reusable agents
Error Handling
GlueLLM provides comprehensive error handling with automatic classification and retry logic.
Exception Hierarchy
from gluellm.api import (
LLMError, # Base exception
TokenLimitError, # Token/context length exceeded
RateLimitError, # Rate limit hit (429)
APIConnectionError, # Network/connection issues (5xx)
InvalidRequestError, # Bad request parameters (400)
AuthenticationError, # Authentication failed (401/403)
)
Handling Errors
from gluellm.api import complete, TokenLimitError, RateLimitError
try:
result = await complete(
user_message="Very long query...",
tools=[my_tool],
)
except TokenLimitError as e:
print(f"Token limit exceeded: {e}")
# Handle by reducing input size
except RateLimitError as e:
print(f"Rate limit hit after retries: {e}")
# Handle by backing off further
except AuthenticationError as e:
print(f"Auth failed: {e}")
# Check API keys
except APIConnectionError as e:
print(f"Connection error after retries: {e}")
# Check network/service status
except InvalidRequestError as e:
print(f"Invalid request: {e}")
# Check parameters
Automatic Retry
GlueLLM automatically retries transient errors with exponential backoff:
Retryable Errors:
RateLimitError- Rate limit hit (429)APIConnectionError- Connection issues (5xx)
Non-Retryable Errors:
TokenLimitError- Need to reduce inputAuthenticationError- Need to fix credentialsInvalidRequestError- Need to fix parameters
Retry Configuration:
# Via environment variables
export GLUELLM_RETRY_MAX_ATTEMPTS=3 # Max retry attempts
export GLUELLM_RETRY_MIN_WAIT=2 # Min wait between retries (seconds)
export GLUELLM_RETRY_MAX_WAIT=30 # Max wait between retries (seconds)
export GLUELLM_RETRY_MULTIPLIER=1 # Exponential backoff multiplier
The retry logic uses exponential backoff: if a request fails with a retryable error, GlueLLM waits before retrying, with wait times increasing exponentially up to the max wait time.
CLI Reference
GlueLLM includes a comprehensive CLI for testing and demonstrations.
Available Commands
After installation, use the gluellm command:
Basic Test Commands
# Test basic completion
gluellm test-completion
# Test tool calling
gluellm test-tool-call
# Test streaming responses
gluellm test-streaming
gluellm test-streaming -m "Tell me a joke"
# Test structured output with Pydantic
gluellm test-structured-output
gluellm test-structured-output -d "Paris, France has 2 million people"
# Test multi-turn conversations
gluellm test-multi-turn-conversation
gluellm test-multi-turn-conversation -t 5
# Run all interactive demos
gluellm demo
# Run example scripts
gluellm examples
Advanced Feature Tests
# Test batch processing
gluellm test-batch-processing
gluellm test-batch-processing -c 10 -n 5
# Test hooks system
gluellm test-hooks
# Test correlation ID tracking
gluellm test-correlation-ids
# Test API key pool
gluellm test-api-key-pool
gluellm test-api-key-pool -k 3
# Test OpenTelemetry integration
gluellm test-telemetry
Performance & Reliability Tests
# Test error handling
gluellm test-error-handling
# Test rate limiting
gluellm test-rate-limiting
gluellm test-rate-limiting -r 10
# Test timeout handling
gluellm test-timeout
gluellm test-timeout -t 10.0
# Test tool registration without execution
gluellm test-tool-without-execution
# Test different models/providers
gluellm test-different-models
gluellm test-different-models -m openai:gpt-4o -m anthropic:claude-3-sonnet
Test Suite Commands
# Run test suite
gluellm run-tests
gluellm run-tests --verbose
gluellm run-tests -c TestBasicToolCalling
gluellm run-tests -t test_single_tool_call
gluellm run-tests --no-integration
Workflow Test Commands
# Test iterative refinement workflow
gluellm test-iterative-workflow
gluellm test-iterative-workflow -i "Write about AI" -n 3 -c 2
# Test pipeline workflow
gluellm test-pipeline-workflow
gluellm test-pipeline-workflow -i "Topic: Python" -s 3
# Test debate workflow
gluellm test-debate-workflow
gluellm test-debate-workflow -t "Is remote work better?" -r 3 --no-judge
# Test reflection workflow
gluellm test-reflection-workflow
gluellm test-reflection-workflow -i "Write about AI" -r 3
# Test chain of density workflow
gluellm test-chain-of-density-workflow
# Test Socratic workflow
gluellm test-socratic-workflow
gluellm test-socratic-workflow -t "What is Python?" -e 5
# Test RAG workflow
gluellm test-rag-workflow
# Test round-robin workflow
gluellm test-round-robin-workflow
# Test consensus workflow
gluellm test-consensus-workflow
# Test hierarchical workflow
gluellm test-hierarchical-workflow
# Test MapReduce workflow
gluellm test-map-reduce-workflow
# Test ReAct workflow
gluellm test-react-workflow
# Test Mixture of Experts workflow
gluellm test-mixture-of-experts-workflow
# Test Constitutional workflow
gluellm test-constitutional-workflow
# Test Tree of Thoughts workflow
gluellm test-tree-of-thoughts-workflow
Legacy Usage: If the gluellm command is not available, you can use python -m gluellm.cli or python source/cli.py (deprecated).
Architecture
graph TB
subgraph UserAPI [User-Facing API]
complete[complete function]
structured[structured_complete function]
GlueLLM[GlueLLM client]
end
subgraph Core [Core Components]
API[api.py - LLM calls]
Config[config.py - Settings]
Conversation[conversation.py - History]
Prompt[prompt.py - Templates]
end
subgraph Agents [Agent System]
Agent[Agent model]
Executor[Executor base]
SimpleExec[SimpleExecutor]
AgentExec[AgentExecutor]
end
subgraph Workflows [Multi-Agent Workflows]
Iterative[IterativeRefinementWorkflow]
Pipeline[PipelineWorkflow]
Debate[DebateWorkflow]
end
subgraph Backend [Backend]
AnyLLM[any-llm-sdk]
Providers[OpenAI Anthropic XAI etc]
end
complete --> API
structured --> API
GlueLLM --> API
API --> Config
API --> Conversation
API --> Prompt
SimpleExec --> API
AgentExec --> Agent
AgentExec --> API
Iterative --> Executor
Pipeline --> Executor
Debate --> Executor
API --> AnyLLM
AnyLLM --> Providers
Tool Execution Flow
sequenceDiagram
participant User
participant GlueLLM
participant LLM
participant Tools
User->>GlueLLM: complete(message, tools)
loop Tool Execution Loop (max N iterations)
GlueLLM->>LLM: Send messages
alt LLM wants to use tools
LLM->>GlueLLM: Tool call requests
loop For each tool call
GlueLLM->>Tools: Execute tool
Tools->>GlueLLM: Tool result
end
GlueLLM->>GlueLLM: Add tool results to messages
else LLM provides final answer
LLM->>GlueLLM: Final response
GlueLLM->>User: ToolExecutionResult
end
end
Deployment & Production
Environment Setup
For production deployments, use environment variables or a secure configuration management system:
# Production environment variables
export GLUELLM_DEFAULT_MODEL=openai:gpt-4o-mini
export GLUELLM_MAX_TOOL_ITERATIONS=10
export GLUELLM_RETRY_MAX_ATTEMPTS=3
export GLUELLM_LOG_LEVEL=INFO
# API Keys (use secrets management in production)
export OPENAI_API_KEY=sk-...
# or
export GLUELLM_OPENAI_API_KEY=sk-...
Docker Deployment
Create a Dockerfile:
FROM python:3.12-slim
WORKDIR /app
# Install uv for dependency management
RUN pip install uv
# Copy dependency files
COPY pyproject.toml uv.lock ./
# Install dependencies
RUN uv pip install --system -e .
# Copy application code
COPY gluellm/ ./gluellm/
COPY examples/ ./examples/
# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV GLUELLM_LOG_LEVEL=INFO
# Run your application
CMD ["python", "your_app.py"]
Build and run:
docker build -t gluellm-app .
docker run -e OPENAI_API_KEY=sk-... gluellm-app
Docker Compose
Example docker-compose.yml:
version: '3.8'
services:
gluellm-app:
build: .
environment:
- GLUELLM_DEFAULT_MODEL=openai:gpt-4o-mini
- GLUELLM_LOG_LEVEL=INFO
- OPENAI_API_KEY=${OPENAI_API_KEY}
volumes:
- ./logs:/app/logs
restart: unless-stopped
Production Considerations
1. API Key Management
- Use secrets management (AWS Secrets Manager, HashiCorp Vault, etc.)
- Never commit API keys to version control
- Rotate keys regularly
- Use least-privilege access
2. Error Handling
- Implement comprehensive error handling
- Set up monitoring and alerting
- Log errors for debugging
- Provide fallback responses for users
3. Rate Limiting
- Implement rate limiting for your application
- Monitor API usage and costs
- Use retry configuration appropriately
- Consider caching responses when possible
4. Logging
- Configure structured logging
- Log to files or centralized logging service
- Include request IDs for tracing
- Don't log sensitive data (API keys, PII)
5. Performance
- Use connection pooling for external APIs
- Implement caching for repeated queries
- Monitor token usage and optimize prompts
- Consider async/await for concurrent operations
6. Security
- Validate all inputs
- Sanitize tool outputs
- Use HTTPS for all API calls
- Implement authentication/authorization
- Regular security updates
7. Scaling
- Use load balancers for multiple instances
- Consider message queues for async processing
- Monitor resource usage
- Scale horizontally as needed
Example Production Application
"""Production-ready GlueLLM application."""
import asyncio
import logging
import os
from contextlib import asynccontextmanager
from gluellm.api import GlueLLM, RateLimitError, TokenLimitError
# Configure logging
logging.basicConfig(
level=os.getenv("GLUELLM_LOG_LEVEL", "INFO"),
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
class ProductionLLMService:
"""Production LLM service with error handling and monitoring."""
def __init__(self):
self.client = GlueLLM(
model=os.getenv("GLUELLM_DEFAULT_MODEL", "openai:gpt-4o-mini"),
system_prompt=os.getenv("GLUELLM_DEFAULT_SYSTEM_PROMPT", "You are helpful."),
max_tool_iterations=int(os.getenv("GLUELLM_MAX_TOOL_ITERATIONS", "10")),
)
async def process_request(self, user_message: str) -> dict:
"""Process a user request with comprehensive error handling."""
try:
result = await self.client.complete(user_message)
return {
"success": True,
"response": result.final_response,
"tool_calls": result.tool_calls_made,
}
except RateLimitError as e:
logger.warning(f"Rate limit hit: {e}")
return {
"success": False,
"error": "Service temporarily unavailable. Please try again later.",
}
except TokenLimitError as e:
logger.error(f"Token limit exceeded: {e}")
return {
"success": False,
"error": "Request too long. Please shorten your message.",
}
except Exception as e:
logger.error(f"Unexpected error: {e}", exc_info=True)
return {
"success": False,
"error": "An error occurred. Please try again.",
}
# Usage
async def main():
service = ProductionLLMService()
result = await service.process_request("Hello!")
print(result)
if __name__ == "__main__":
asyncio.run(main())
Monitoring & Observability
- Metrics: Track request counts, latency, error rates
- Logging: Structured logs with correlation IDs
- Alerting: Set up alerts for high error rates or rate limits
- Cost Monitoring: Track token usage and API costs
CI/CD Integration
Example GitHub Actions workflow:
name: Deploy GlueLLM App
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.12'
- name: Install dependencies
run: |
pip install uv
uv pip install -e ".[dev]"
- name: Run tests
run: uv run pytest tests/ -m "not integration"
- name: Deploy
run: |
# Your deployment steps
Development
Running Tests
# Run all tests
uv run pytest tests/
# Run specific test file
uv run pytest tests/test_api.py
# Run with verbose output
uv run pytest tests/ -v
# Run specific test
uv run pytest tests/test_api.py::test_simple_completion
# Skip integration tests
uv run pytest tests/ -m "not integration"
# Using the CLI
gluellm run-tests
gluellm run-tests --verbose
Running Examples
# Run all examples
uv run python examples/basic_usage.py
# Run CLI demos
gluellm demo
# Run specific workflow examples
gluellm test-iterative-workflow
gluellm test-pipeline-workflow
gluellm test-debate-workflow
Project Structure
gluellm/
├── gluellm/ # Main package (use this)
│ ├── api.py # Core API (GlueLLM, complete, structured_complete)
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration management
│ ├── models/
│ │ ├── agent.py # Agent model
│ │ ├── config.py # Request configuration
│ │ ├── conversation.py # Conversation and message models
│ │ ├── prompt.py # Prompt templates
│ │ └── workflow.py # Workflow configurations
│ ├── executors/
│ │ ├── _base.py # Executor interface
│ │ └── __init__.py # SimpleExecutor, AgentExecutor
│ ├── workflows/
│ │ ├── _base.py # Workflow base class
│ │ ├── iterative.py # Iterative refinement workflow
│ │ ├── pipeline.py # Pipeline workflow
│ │ └── debate.py # Debate workflow
│ └── agents/
│ └── generic.py # Generic agent implementation
├── source/ # Legacy/alternate CLI location (deprecated)
│ └── cli.py # Use 'gluellm' command instead
├── examples/
│ ├── basic_usage.py # Usage examples
│ ├── configuration_example.py # Configuration examples
│ ├── error_handling_example.py # Error handling examples
│ └── workflow_example.py # Workflow examples
├── tests/
│ ├── test_api.py # API tests
│ ├── test_cli.py # CLI tests
│ ├── test_workflows.py # Workflow tests
│ ├── test_error_handling.py # Error handling tests
│ └── test_llm_edge_cases.py # LLM integration tests
├── pyproject.toml # Project configuration
├── env.example # Example environment configuration
├── LICENSE # MIT License
├── CONTRIBUTING.md # Contribution guidelines
├── CHANGELOG.md # Version history
├── SECURITY.md # Security policy
└── README.md # This file
Note: The source/ directory contains legacy CLI code. The main package is gluellm/. Use the gluellm CLI command (after installation) instead of python source/cli.py.
Code Quality
# Format code with ruff
uv run ruff format .
# Lint code
uv run ruff check .
# Run pre-commit hooks
uv run pre-commit run --all-files
Requirements
- Python: 3.12+
- Core Dependencies:
any-llm-sdk[openai]>=1.5.0- Multi-provider LLM supportpydantic>=2.12.5- Data validation and structured outputpydantic-settings>=2.12.0- Configuration managementtenacity>=9.0.0- Retry logic with exponential backoffthrottled-py>=3.1.0- High-performance rate limitingclick>=8.3.1- CLI frameworkrich>=14.2.0- Terminal formattingjinja2>=3.1.6- Prompt templating
See pyproject.toml for complete dependency list.
Known Limitations
Streaming with Tools
When using stream_complete() with tools enabled, the final response after tool execution is not streamed token-by-token. Instead, it's returned as a single chunk. This is because we need to check if the model wants to call additional tools before streaming.
True streaming only occurs when:
- No tools are provided (
tools=Noneor empty list), OR - Tool execution is disabled (
execute_tools=False)
# This will stream token-by-token (no tools)
async for chunk in stream_complete("Tell me a story"):
print(chunk.content, end="", flush=True)
# This will NOT stream the final response (tools enabled)
async for chunk in stream_complete("Get weather", tools=[get_weather]):
# Tool execution happens, then final response comes as one chunk
print(chunk.content, end="", flush=True)
This is a known architectural limitation and may be improved in future versions.
Contributing
- Follow the existing code style (enforced by
ruff) - Add tests for new features
- Update documentation as needed
- Run tests and linters before submitting
- Use conventional commit messages
# Setup development environment
uv pip install -e ".[dev]"
uv run pre-commit install
# Before committing
uv run pytest tests/
uv run ruff check .
uv run ruff format .
License
MIT License - see LICENSE file for details.
Copyright (c) 2025 GlueLLM Contributors
Credits
Built on top of any-llm-sdk for multi-provider LLM support.
Need Help? Check out the examples/ directory for more usage patterns, or run gluellm demo to see interactive demonstrations of all features.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gluellm-1.0.2.tar.gz.
File metadata
- Download URL: gluellm-1.0.2.tar.gz
- Upload date:
- Size: 217.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecbae1ac706a7f812c791a35d77d1942145356d322a6ad9f3a9693b7eb944ef7
|
|
| MD5 |
7c14343126853aa169dd2b6a86707e84
|
|
| BLAKE2b-256 |
a28e72d519cea21407677c71a9d10ee5200a21de76697160cccc682d94b73928
|
File details
Details for the file gluellm-1.0.2-py3-none-any.whl.
File metadata
- Download URL: gluellm-1.0.2-py3-none-any.whl
- Upload date:
- Size: 160.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6abfc5f093217b73e3defc39c2161144ba549ded7179d85464719dcc5e16935b
|
|
| MD5 |
091309a9fb0df51f813c7b03864d2761
|
|
| BLAKE2b-256 |
ddd68c50487c1281d94feef75cad14a794e54ba186bea4255c193190e36c3b8d
|