High-performance LLM client with batch processing, caching, and checkpoint recovery
Project description
flexllm
One Client, All LLMs
Production-grade LLM client with checkpoint recovery, response caching, and multi-provider support
Design Philosophy
One unified entry point for all LLM providers.
from flexllm import LLMClient
# That's all you need to import. Everything else is configuration.
flexllm follows the "Single Interface, Multiple Backends" principle. Whether you're calling OpenAI, Gemini, Claude, or a self-hosted model, the API stays the same. Provider differences are abstracted away - you focus on your application logic, not on SDK quirks.
# OpenAI GPT-4
client = LLMClient(base_url="https://api.openai.com/v1", model="gpt-4", api_key="...")
# Google Gemini
client = LLMClient(provider="gemini", model="gemini-2.0-flash", api_key="...")
# Anthropic Claude
client = LLMClient(provider="claude", model="claude-sonnet-4-20250514", api_key="...")
# Self-hosted (vLLM, Ollama, etc.)
client = LLMClient(base_url="http://localhost:8000/v1", model="qwen2.5")
# The API is identical for all:
result = await client.chat_completions(messages)
results = await client.chat_completions_batch(messages_list)
Features
| Feature | Description |
|---|---|
| Unified Interface | One LLMClient for OpenAI, Gemini, Claude, and any OpenAI-compatible API |
| Checkpoint Recovery | Batch jobs auto-resume from interruption - process millions of requests safely |
| Response Caching | Built-in caching with TTL and IPC multi-process sharing |
| Cost Tracking | Real-time cost monitoring with budget control |
| High-Performance Async | Fine-grained concurrency control, QPS limiting, and streaming |
| Load Balancing | Multi-endpoint distribution with automatic failover |
Installation
pip install flexllm
# With all features
pip install flexllm[all]
Quick Start
Basic Usage
from flexllm import LLMClient
client = LLMClient(
model="gpt-4",
base_url="https://api.openai.com/v1",
api_key="your-api-key"
)
# Async
response = await client.chat_completions([
{"role": "user", "content": "Hello!"}
])
# Sync
response = client.chat_completions_sync([
{"role": "user", "content": "Hello!"}
])
Batch Processing with Checkpoint Recovery
Process millions of requests safely. If interrupted, just restart - it continues from where it left off.
messages_list = [
[{"role": "user", "content": f"Question {i}"}]
for i in range(100000)
]
# Interrupted at 50,000? Re-run and it continues from 50,001.
results = await client.chat_completions_batch(
messages_list,
output_jsonl="results.jsonl", # Progress saved here
show_progress=True,
)
Response Caching
from flexllm import LLMClient, ResponseCacheConfig
client = LLMClient(
model="gpt-4",
base_url="https://api.openai.com/v1",
api_key="your-api-key",
cache=ResponseCacheConfig(enabled=True, ttl=3600), # 1 hour TTL
)
# First call: API request (~2s, ~$0.01)
result1 = await client.chat_completions(messages)
# Second call: Cache hit (~0.001s, $0)
result2 = await client.chat_completions(messages)
Cost Tracking
# Track costs during batch processing
results, cost_report = await client.chat_completions_batch(
messages_list,
return_cost_report=True,
)
print(f"Total cost: ${cost_report.total_cost:.4f}")
# Real-time cost display in progress bar
results = await client.chat_completions_batch(
messages_list,
track_cost=True, # Shows ๐ฐ $0.0012 in progress bar
)
Streaming
# Token-by-token streaming
async for chunk in client.chat_completions_stream(messages):
print(chunk, end="", flush=True)
# Batch streaming - process results as they complete
async for result in client.iter_chat_completions_batch(messages_list):
process(result)
Multi-Provider Support
from flexllm import LLMClient
# OpenAI (auto-detected from base_url)
client = LLMClient(
base_url="https://api.openai.com/v1",
api_key="sk-...",
model="gpt-4o",
)
# Gemini
client = LLMClient(
provider="gemini",
api_key="your-gemini-key",
model="gemini-2.0-flash",
)
# Claude
client = LLMClient(
provider="claude",
api_key="your-anthropic-key",
model="claude-sonnet-4-20250514",
)
# Self-hosted (vLLM, Ollama, etc.)
client = LLMClient(
base_url="http://localhost:8000/v1",
model="qwen2.5",
)
Thinking Mode (Reasoning Models)
Unified interface for DeepSeek-R1, Qwen3, Claude extended thinking, Gemini thinking.
result = await client.chat_completions(
messages,
thinking=True, # Enable thinking
return_raw=True,
)
# Unified parsing across all providers
parsed = client.parse_thoughts(result.data)
print("Thinking:", parsed["thought"])
print("Answer:", parsed["answer"])
Load Balancing
from flexllm import LLMClientPool
pool = LLMClientPool(
endpoints=[
{"base_url": "http://gpu1:8000/v1", "model": "qwen"},
{"base_url": "http://gpu2:8000/v1", "model": "qwen"},
],
load_balance="round_robin", # or "weighted", "random", "fallback"
fallback=True, # Auto-switch on failure
)
# Requests automatically distributed
results = await pool.chat_completions_batch(messages_list, distribute=True)
CLI
# Quick ask
flexllm ask "What is Python?"
# Interactive chat
flexllm chat
# Batch processing with cost tracking
flexllm batch input.jsonl -o output.jsonl --track-cost
# Model management
flexllm list # Configured models
flexllm models # Remote available models
flexllm test # Test connection
Architecture
flexllm/
โโโ clients/ # All client implementations
โ โโโ base.py # Abstract base class (LLMClientBase)
โ โโโ llm.py # Unified entry point (LLMClient)
โ โโโ openai.py # OpenAI-compatible backend
โ โโโ gemini.py # Google Gemini backend
โ โโโ claude.py # Anthropic Claude backend
โ โโโ pool.py # Multi-endpoint load balancer
โ โโโ router.py # Provider routing strategies
โโโ pricing/ # Cost estimation and tracking
โ โโโ cost_tracker.py
โ โโโ token_counter.py
โโโ cache/ # Response caching with IPC
โโโ async_api/ # High-performance async engine
โโโ msg_processors/ # Multi-modal message processing
The architecture follows a simple layered design:
LLMClient (Unified entry point - recommended)
โ
โโโ Provider auto-detection or explicit selection
โ
โโโ Backend Clients (internal)
โโโ OpenAIClient
โโโ GeminiClient
โโโ ClaudeClient
โ
โโโ LLMClientBase (Abstract - 4 methods to implement)
โ
โโโ ConcurrentRequester (Async engine)
โโโ ResponseCache (Caching layer)
โโโ CostTracker (Cost monitoring)
API Reference
LLMClient
LLMClient(
provider: str = "auto", # "auto", "openai", "gemini", "claude"
model: str, # Model name
base_url: str = None, # API base URL (required for openai)
api_key: str = "EMPTY", # API key
cache: ResponseCacheConfig, # Cache config
concurrency_limit: int = 10, # Max concurrent requests
max_qps: float = None, # Max requests per second
retry_times: int = 3, # Retry count on failure
timeout: int = 120, # Request timeout (seconds)
)
Main Methods
| Method | Description |
|---|---|
chat_completions(messages) |
Single async request |
chat_completions_sync(messages) |
Single sync request |
chat_completions_batch(messages_list) |
Batch async with checkpoint |
iter_chat_completions_batch(messages_list) |
Streaming batch results |
chat_completions_stream(messages) |
Token-by-token streaming |
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flexllm-0.4.2.tar.gz.
File metadata
- Download URL: flexllm-0.4.2.tar.gz
- Upload date:
- Size: 139.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef3ed23491075b43852cae3908c422ab9256258586090f3065ba518274dd5354
|
|
| MD5 |
2e8bcd0d4fcb47e7858b48eaaa8fe27c
|
|
| BLAKE2b-256 |
3d5d8f25072a13359ee431986b035ffe6b32eea0b2c8b2846373aa388098c2b8
|
Provenance
The following attestation bundles were made for flexllm-0.4.2.tar.gz:
Publisher:
python-publish.yml on KenyonY/flexllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
flexllm-0.4.2.tar.gz -
Subject digest:
ef3ed23491075b43852cae3908c422ab9256258586090f3065ba518274dd5354 - Sigstore transparency entry: 841596020
- Sigstore integration time:
-
Permalink:
KenyonY/flexllm@7b917753a358d5c4fd76954357a101783fab4275 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/KenyonY
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7b917753a358d5c4fd76954357a101783fab4275 -
Trigger Event:
push
-
Statement type:
File details
Details for the file flexllm-0.4.2-py3-none-any.whl.
File metadata
- Download URL: flexllm-0.4.2-py3-none-any.whl
- Upload date:
- Size: 138.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9e95cf6222a1bf7a939600e88f15be68595972421d49601fdd0ec89bbc64538
|
|
| MD5 |
1cc03fe7ac34551c2a09160b205f5c9d
|
|
| BLAKE2b-256 |
36daacc391fbee7c386dec2d3c6fc1a77dad03e5f19832fb6fe4c16c73696316
|
Provenance
The following attestation bundles were made for flexllm-0.4.2-py3-none-any.whl:
Publisher:
python-publish.yml on KenyonY/flexllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
flexllm-0.4.2-py3-none-any.whl -
Subject digest:
e9e95cf6222a1bf7a939600e88f15be68595972421d49601fdd0ec89bbc64538 - Sigstore transparency entry: 841596021
- Sigstore integration time:
-
Permalink:
KenyonY/flexllm@7b917753a358d5c4fd76954357a101783fab4275 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/KenyonY
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7b917753a358d5c4fd76954357a101783fab4275 -
Trigger Event:
push
-
Statement type: