Skip to main content

High-performance LLM client with batch processing, caching, and checkpoint recovery

Project description

flexllm

High-performance LLM client with batch processing, caching, and checkpoint recovery

PyPI version License pypi downloads


Why flexllm?

flexllm is designed for production LLM applications where reliability and efficiency matter:

  • One Interface, Multiple Providers: Write code once, switch between OpenAI, Gemini, Claude, or self-hosted models (vLLM, Ollama) without changing your application logic.

  • Production-Ready: Built-in retry, timeout, QPS limiting, and checkpoint recovery - handle API failures gracefully without losing progress on large batch jobs.

  • Simple by Design: KISS principle - minimal configuration, sensible defaults, and a clean API that stays out of your way.

Features

  • Batch Processing: Process thousands of requests concurrently with QPS control
  • Response Caching: Built-in caching with TTL support, avoid duplicate API calls
  • Checkpoint Recovery: Resume interrupted batch jobs automatically
  • Multi-Provider: OpenAI, Gemini, Claude, and any OpenAI-compatible API (vLLM, Ollama, DeepSeek, Qwen...)
  • Function Calling: Unified tool use support across providers
  • Multi-Modal: Image + text processing with automatic base64 encoding
  • Load Balancing: Multi-endpoint client pool with failover
  • Async-First: Built on asyncio for maximum performance
  • CLI Tool: Quick ask, chat, and test commands

Installation

pip install flexllm

# With caching support
pip install flexllm[cache]

# With CLI support
pip install flexllm[cli]

# All features
pip install flexllm[all]

Quick Start

Single Request

from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key"
)

# Async
response = await client.chat_completions([
    {"role": "user", "content": "Hello!"}
])

# Sync
response = client.chat_completions_sync([
    {"role": "user", "content": "Hello!"}
])

Batch Processing with Checkpoint Recovery

from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    concurrency_limit=50,
    max_qps=100,
)

messages_list = [
    [{"role": "user", "content": "What is 1+1?"}],
    [{"role": "user", "content": "What is 2+2?"}],
    # ... thousands more
]

# Batch processing with checkpoint recovery
# If interrupted, re-running will resume from where it stopped
results = await client.chat_completions_batch(
    messages_list,
    output_file="results.jsonl",  # Auto-save progress
    show_progress=True,
)

Response Caching

from flexllm import LLMClient, ResponseCacheConfig

# Enable caching (avoid duplicate API calls)
client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    cache=ResponseCacheConfig(enabled=True, ttl=3600),  # 1 hour TTL
)

# Duplicate requests hit cache automatically
result1 = await client.chat_completions(messages)  # API call
result2 = await client.chat_completions(messages)  # Cache hit (instant)

Streaming Response

async for chunk in client.chat_completions_stream(messages):
    print(chunk, end="", flush=True)

Multi-Modal (Vision)

from flexllm import MllmClient

client = MllmClient(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    model="gpt-4o",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "path/to/image.jpg"}}  # Local path or URL
        ]
    }
]

response = await client.call_llm([messages])

Load Balancing with Failover

from flexllm import LLMClientPool

# Create client pool with multiple endpoints
pool = LLMClientPool(
    endpoints=[
        {"base_url": "http://host1:8000/v1", "api_key": "key1", "model": "qwen"},
        {"base_url": "http://host2:8000/v1", "api_key": "key2", "model": "qwen"},
    ],
    load_balance="round_robin",  # round_robin, weighted, random, fallback
    fallback=True,  # Auto switch on failure
)

# Same API as LLMClient
result = await pool.chat_completions(messages)

# Distribute batch requests across endpoints
results = await pool.chat_completions_batch(messages_list, distribute=True)

Gemini Client

from flexllm import GeminiClient

# Gemini Developer API
client = GeminiClient(
    model="gemini-3-flash-preview",
    api_key="your-gemini-api-key"
)

# With thinking mode
response = await client.chat_completions(
    messages,
    thinking="high",  # False, True, "minimal", "low", "medium", "high"
)

# Vertex AI mode
client = GeminiClient(
    model="gemini-3-flash-preview",
    project_id="your-project-id",
    location="us-central1",
    use_vertex_ai=True,
)

Claude Client

from flexllm import LLMClient, ClaudeClient

# Using unified LLMClient (recommended)
client = LLMClient(
    provider="claude",
    api_key="your-anthropic-key",
    model="claude-3-5-sonnet-20241022",
)

response = await client.chat_completions([
    {"role": "user", "content": "Hello, Claude!"}
])

# Or use ClaudeClient directly
client = ClaudeClient(
    api_key="your-anthropic-key",
    model="claude-3-5-sonnet-20241022",
)

# With extended thinking
response = await client.chat_completions(
    messages,
    thinking=True,  # or budget_tokens as int
    return_raw=True,
)
parsed = ClaudeClient.parse_thoughts(response.data)

Function Calling (Tool Use)

from flexllm import LLMClient

client = LLMClient(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    model="gpt-4",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

# Returns ChatCompletionResult with tool_calls
result = await client.chat_completions(
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    return_usage=True,
)

if result.tool_calls:
    for tool_call in result.tool_calls:
        print(f"Function: {tool_call.function['name']}")
        print(f"Arguments: {tool_call.function['arguments']}")

Thinking Mode (DeepSeek, etc.)

from flexllm import OpenAIClient

client = OpenAIClient(
    base_url="https://api.deepseek.com/v1",
    api_key="your-key",
    model="deepseek-reasoner",
)

# Enable thinking
result = await client.chat_completions(
    messages,
    thinking=True,
    return_raw=True,
)

# Parse thinking content
parsed = OpenAIClient.parse_thoughts(result.data)
print("Thinking:", parsed["thought"])
print("Answer:", parsed["answer"])

CLI Usage

# Quick ask (for scripts/agents)
flexllm ask "What is Python?"
flexllm ask "Explain this" -s "You are a code expert"
echo "long text" | flexllm ask "Summarize"

# Interactive chat
flexllm chat
flexllm chat "Hello"
flexllm chat --model=gpt-4 "Hello"

# List models
flexllm models           # Remote models
flexllm list_models      # Configured models

# Test connection
flexllm test

# Initialize config
flexllm init

CLI Configuration

Create ~/.flexllm/config.yaml:

default: "gpt-4"

models:
  - id: gpt-4
    name: gpt-4
    provider: openai
    base_url: https://api.openai.com/v1
    api_key: your-api-key

  - id: local
    name: local-ollama
    provider: openai
    base_url: http://localhost:11434/v1
    api_key: EMPTY

Or use environment variables:

export FLEXLLM_BASE_URL="https://api.openai.com/v1"
export FLEXLLM_API_KEY="your-key"
export FLEXLLM_MODEL="gpt-4"

API Reference

LLMClient

Main unified client for all providers.

LLMClient(
    model: str,                    # Model name
    base_url: str,                 # API base URL
    api_key: str = "EMPTY",        # API key
    provider: str = "auto",        # "auto", "openai", "gemini", "claude"
    cache: ResponseCacheConfig = None,  # Cache config
    concurrency_limit: int = 50,   # Max concurrent requests
    max_qps: float = None,         # Max requests per second
    retry_times: int = 3,          # Retry count on failure
    retry_delay: float = 1.0,      # Delay between retries
    timeout: int = 120,            # Request timeout (seconds)
)

Methods

Method Description
chat_completions(messages) Single async request
chat_completions_sync(messages) Single sync request
chat_completions_batch(messages_list) Batch async requests
chat_completions_batch_sync(messages_list) Batch sync requests
chat_completions_stream(messages) Streaming response

ResponseCacheConfig

ResponseCacheConfig(
    enabled: bool = False,         # Enable caching
    ttl: int = 86400,              # Time-to-live in seconds (default 24h)
    cache_dir: str = "~/.cache/flexllm/llm_response",
    use_ipc: bool = True,          # Use IPC for multi-process sharing
)

# Shortcuts
ResponseCacheConfig.with_ttl(3600)     # 1 hour TTL
ResponseCacheConfig.persistent()        # Never expire

Token Counting

from flexllm import count_tokens, estimate_cost, estimate_batch_cost

# Count tokens
tokens = count_tokens("Hello world", model="gpt-4")

# Estimate cost
cost = estimate_cost(tokens, model="gpt-4", is_input=True)

# Estimate batch cost
total_cost = estimate_batch_cost(messages_list, model="gpt-4")

Architecture

flexllm/
├── flexllm/
│   ├── llm_client.py          # Unified client (recommended)
│   ├── openaiclient.py        # OpenAI-compatible API
│   ├── geminiclient.py        # Google Gemini
│   ├── claudeclient.py        # Anthropic Claude
│   ├── mllm_client.py         # Multi-modal client
│   ├── client_pool.py         # Load balancing pool
│   ├── response_cache.py      # Response caching
│   ├── token_counter.py       # Token counting & cost
│   ├── async_api/             # Async engine
│   └── processors/            # Image & message processing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexllm-0.3.0.tar.gz (121.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flexllm-0.3.0-py3-none-any.whl (123.1 kB view details)

Uploaded Python 3

File details

Details for the file flexllm-0.3.0.tar.gz.

File metadata

  • Download URL: flexllm-0.3.0.tar.gz
  • Upload date:
  • Size: 121.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flexllm-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ebdae66782bbe2574b44d250d88bff4be5a53bdd403c1c9d22ed364d995fe28d
MD5 df7fa41944a256d32d3baa8376642c43
BLAKE2b-256 41cc2c627739d3acf7e3fb46b14b0a87a77b442a8338e64392b12cd14b584afe

See more details on using hashes here.

Provenance

The following attestation bundles were made for flexllm-0.3.0.tar.gz:

Publisher: python-publish.yml on KenyonY/flexllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flexllm-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: flexllm-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 123.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flexllm-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4f2f459482e296181f86306457dfd8229acc67ccc0f388ce9e30df385bca5060
MD5 8f9c1babbe3de837a6b3668c383b7f59
BLAKE2b-256 fd40b7293a045cb06365a089114d9205cae37eb54d8174c441e960912f99ff54

See more details on using hashes here.

Provenance

The following attestation bundles were made for flexllm-0.3.0-py3-none-any.whl:

Publisher: python-publish.yml on KenyonY/flexllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page