High-performance LLM client with batch processing, caching, and checkpoint recovery
Project description
flexllm
High-performance LLM client with batch processing, caching, and checkpoint recovery
Features
- Batch Processing: Process thousands of requests concurrently with QPS control
- Response Caching: Built-in caching with TTL support, avoid duplicate API calls
- Checkpoint Recovery: Resume interrupted batch jobs automatically
- Multi-Provider: OpenAI, Gemini, and any OpenAI-compatible API (vLLM, Ollama, DeepSeek, Qwen...)
- Multi-Modal: Image + text processing with automatic base64 encoding
- Load Balancing: Multi-endpoint client pool with failover
- Async-First: Built on asyncio for maximum performance
- CLI Tool: Quick ask, chat, and test commands
Installation
pip install flexllm
# With Gemini support
pip install flexllm[gemini]
# With caching support
pip install flexllm[cache]
# With CLI support
pip install flexllm[cli]
# All features
pip install flexllm[all]
Quick Start
Single Request
from flexllm import LLMClient
client = LLMClient(
model="gpt-4",
base_url="https://api.openai.com/v1",
api_key="your-api-key"
)
# Async
response = await client.chat_completions([
{"role": "user", "content": "Hello!"}
])
# Sync
response = client.chat_completions_sync([
{"role": "user", "content": "Hello!"}
])
Batch Processing with Checkpoint Recovery
from flexllm import LLMClient
client = LLMClient(
model="gpt-4",
base_url="https://api.openai.com/v1",
api_key="your-api-key",
concurrency_limit=50,
max_qps=100,
)
messages_list = [
[{"role": "user", "content": "What is 1+1?"}],
[{"role": "user", "content": "What is 2+2?"}],
# ... thousands more
]
# Batch processing with checkpoint recovery
# If interrupted, re-running will resume from where it stopped
results = await client.chat_completions_batch(
messages_list,
output_file="results.jsonl", # Auto-save progress
show_progress=True,
)
Response Caching
from flexllm import LLMClient, ResponseCacheConfig
# Enable caching (avoid duplicate API calls)
client = LLMClient(
model="gpt-4",
base_url="https://api.openai.com/v1",
api_key="your-api-key",
cache=ResponseCacheConfig(enabled=True, ttl=3600), # 1 hour TTL
)
# Duplicate requests hit cache automatically
result1 = await client.chat_completions(messages) # API call
result2 = await client.chat_completions(messages) # Cache hit (instant)
Streaming Response
async for chunk in client.chat_completions_stream(messages):
print(chunk, end="", flush=True)
Multi-Modal (Vision)
from flexllm import MllmClient
client = MllmClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model="gpt-4o",
)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "path/to/image.jpg"}} # Local path or URL
]
}
]
response = await client.call_llm([messages])
Load Balancing with Failover
from flexllm import LLMClientPool
# Create client pool with multiple endpoints
pool = LLMClientPool(
endpoints=[
{"base_url": "http://host1:8000/v1", "api_key": "key1", "model": "qwen"},
{"base_url": "http://host2:8000/v1", "api_key": "key2", "model": "qwen"},
],
load_balance="round_robin", # round_robin, weighted, random, fallback
fallback=True, # Auto switch on failure
)
# Same API as LLMClient
result = await pool.chat_completions(messages)
# Distribute batch requests across endpoints
results = await pool.chat_completions_batch(messages_list, distribute=True)
Gemini Client
from flexllm import GeminiClient
# Gemini Developer API
client = GeminiClient(
model="gemini-2.5-flash",
api_key="your-gemini-api-key"
)
# With thinking mode
response = await client.chat_completions(
messages,
thinking="high", # False, True, "minimal", "low", "medium", "high"
)
# Vertex AI mode
client = GeminiClient(
model="gemini-2.5-flash",
project_id="your-project-id",
location="us-central1",
use_vertex_ai=True,
)
Thinking Mode (DeepSeek, etc.)
from flexllm import OpenAIClient
client = OpenAIClient(
base_url="https://api.deepseek.com/v1",
api_key="your-key",
model="deepseek-reasoner",
)
# Enable thinking
result = await client.chat_completions(
messages,
thinking=True,
return_raw=True,
)
# Parse thinking content
parsed = OpenAIClient.parse_thoughts(result.data)
print("Thinking:", parsed["thought"])
print("Answer:", parsed["answer"])
CLI Usage
# Quick ask (for scripts/agents)
flexllm ask "What is Python?"
flexllm ask "Explain this" -s "You are a code expert"
echo "long text" | flexllm ask "Summarize"
# Interactive chat
flexllm chat
flexllm chat "Hello"
flexllm chat --model=gpt-4 "Hello"
# List models
flexllm models # Remote models
flexllm list_models # Configured models
# Test connection
flexllm test
# Initialize config
flexllm init
CLI Configuration
Create ~/.flexllm/config.yaml:
default: "gpt-4"
models:
- id: gpt-4
name: gpt-4
provider: openai
base_url: https://api.openai.com/v1
api_key: your-api-key
- id: local
name: local-ollama
provider: openai
base_url: http://localhost:11434/v1
api_key: EMPTY
Or use environment variables:
export FLEXLLM_BASE_URL="https://api.openai.com/v1"
export FLEXLLM_API_KEY="your-key"
export FLEXLLM_MODEL="gpt-4"
API Reference
LLMClient
Main client for OpenAI-compatible APIs.
LLMClient(
model: str, # Model name
base_url: str, # API base URL
api_key: str = "EMPTY", # API key
provider: str = "auto", # "auto", "openai", "gemini"
cache: ResponseCacheConfig = None, # Cache config
concurrency_limit: int = 50, # Max concurrent requests
max_qps: float = None, # Max requests per second
retry_times: int = 3, # Retry count on failure
retry_delay: float = 1.0, # Delay between retries
timeout: int = 120, # Request timeout (seconds)
)
Methods
| Method | Description |
|---|---|
chat_completions(messages) |
Single async request |
chat_completions_sync(messages) |
Single sync request |
chat_completions_batch(messages_list) |
Batch async requests |
chat_completions_batch_sync(messages_list) |
Batch sync requests |
chat_completions_stream(messages) |
Streaming response |
ResponseCacheConfig
ResponseCacheConfig(
enabled: bool = False, # Enable caching
ttl: int = 86400, # Time-to-live in seconds (default 24h)
cache_dir: str = "~/.cache/flexllm/llm_response",
use_ipc: bool = True, # Use IPC for multi-process sharing
)
# Shortcuts
ResponseCacheConfig.with_ttl(3600) # 1 hour TTL
ResponseCacheConfig.persistent() # Never expire
Token Counting
from flexllm import count_tokens, estimate_cost, estimate_batch_cost
# Count tokens
tokens = count_tokens("Hello world", model="gpt-4")
# Estimate cost
cost = estimate_cost(tokens, model="gpt-4", is_input=True)
# Estimate batch cost
total_cost = estimate_batch_cost(messages_list, model="gpt-4")
Architecture
flexllm/
├── flexllm/
│ ├── llm_client.py # Unified client (recommended)
│ ├── openaiclient.py # OpenAI-compatible API
│ ├── geminiclient.py # Google Gemini
│ ├── mllm_client.py # Multi-modal client
│ ├── client_pool.py # Load balancing pool
│ ├── response_cache.py # Response caching
│ ├── token_counter.py # Token counting & cost
│ ├── async_api/ # Async engine
│ └── processors/ # Image & message processing
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flexllm-0.2.0.tar.gz.
File metadata
- Download URL: flexllm-0.2.0.tar.gz
- Upload date:
- Size: 110.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35e7b442098319b5ae9090785bed03039671b8e4d636dd5a9af2fdebc74ceeba
|
|
| MD5 |
d141fa4da89e684e90d75b5794073e1c
|
|
| BLAKE2b-256 |
8698cdfe38362d9904761a336ba9031f603531030e49521e94a226c6adeab801
|
Provenance
The following attestation bundles were made for flexllm-0.2.0.tar.gz:
Publisher:
python-publish.yml on KenyonY/flexllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
flexllm-0.2.0.tar.gz -
Subject digest:
35e7b442098319b5ae9090785bed03039671b8e4d636dd5a9af2fdebc74ceeba - Sigstore transparency entry: 798526429
- Sigstore integration time:
-
Permalink:
KenyonY/flexllm@e17388e4afa03dfe34f3b464c28e87648de1f92c -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/KenyonY
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e17388e4afa03dfe34f3b464c28e87648de1f92c -
Trigger Event:
push
-
Statement type:
File details
Details for the file flexllm-0.2.0-py3-none-any.whl.
File metadata
- Download URL: flexllm-0.2.0-py3-none-any.whl
- Upload date:
- Size: 119.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
035d7571a3643d1d95c873ac291d6f9244f5d1c0b473f31ed7a5da7e86c1ee7f
|
|
| MD5 |
5598dfbb7634a2f9e4523d92325d814c
|
|
| BLAKE2b-256 |
7c6d0752567b1226efbe2a289c388910556b634273fec08d2cfa0a880aad705b
|
Provenance
The following attestation bundles were made for flexllm-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on KenyonY/flexllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
flexllm-0.2.0-py3-none-any.whl -
Subject digest:
035d7571a3643d1d95c873ac291d6f9244f5d1c0b473f31ed7a5da7e86c1ee7f - Sigstore transparency entry: 798526433
- Sigstore integration time:
-
Permalink:
KenyonY/flexllm@e17388e4afa03dfe34f3b464c28e87648de1f92c -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/KenyonY
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e17388e4afa03dfe34f3b464c28e87648de1f92c -
Trigger Event:
push
-
Statement type: