Skip to main content

High-performance LLM client with batch processing, caching, and checkpoint recovery

Project description

flexllm

High-performance LLM client with batch processing, caching, and checkpoint recovery

PyPI version License pypi downloads


Features

  • Batch Processing: Process thousands of requests concurrently with QPS control
  • Response Caching: Built-in caching with TTL support, avoid duplicate API calls
  • Checkpoint Recovery: Resume interrupted batch jobs automatically
  • Multi-Provider: OpenAI, Gemini, and any OpenAI-compatible API (vLLM, Ollama, DeepSeek, Qwen...)
  • Multi-Modal: Image + text processing with automatic base64 encoding
  • Load Balancing: Multi-endpoint client pool with failover
  • Async-First: Built on asyncio for maximum performance
  • CLI Tool: Quick ask, chat, and test commands

Installation

pip install flexllm

# With Gemini support
pip install flexllm[gemini]

# With caching support
pip install flexllm[cache]

# With CLI support
pip install flexllm[cli]

# All features
pip install flexllm[all]

Quick Start

Single Request

from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key"
)

# Async
response = await client.chat_completions([
    {"role": "user", "content": "Hello!"}
])

# Sync
response = client.chat_completions_sync([
    {"role": "user", "content": "Hello!"}
])

Batch Processing with Checkpoint Recovery

from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    concurrency_limit=50,
    max_qps=100,
)

messages_list = [
    [{"role": "user", "content": "What is 1+1?"}],
    [{"role": "user", "content": "What is 2+2?"}],
    # ... thousands more
]

# Batch processing with checkpoint recovery
# If interrupted, re-running will resume from where it stopped
results = await client.chat_completions_batch(
    messages_list,
    output_file="results.jsonl",  # Auto-save progress
    show_progress=True,
)

Response Caching

from flexllm import LLMClient, ResponseCacheConfig

# Enable caching (avoid duplicate API calls)
client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    cache=ResponseCacheConfig(enabled=True, ttl=3600),  # 1 hour TTL
)

# Duplicate requests hit cache automatically
result1 = await client.chat_completions(messages)  # API call
result2 = await client.chat_completions(messages)  # Cache hit (instant)

Streaming Response

async for chunk in client.chat_completions_stream(messages):
    print(chunk, end="", flush=True)

Multi-Modal (Vision)

from flexllm import MllmClient

client = MllmClient(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    model="gpt-4o",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "path/to/image.jpg"}}  # Local path or URL
        ]
    }
]

response = await client.call_llm([messages])

Load Balancing with Failover

from flexllm import LLMClientPool

# Create client pool with multiple endpoints
pool = LLMClientPool(
    endpoints=[
        {"base_url": "http://host1:8000/v1", "api_key": "key1", "model": "qwen"},
        {"base_url": "http://host2:8000/v1", "api_key": "key2", "model": "qwen"},
    ],
    load_balance="round_robin",  # round_robin, weighted, random, fallback
    fallback=True,  # Auto switch on failure
)

# Same API as LLMClient
result = await pool.chat_completions(messages)

# Distribute batch requests across endpoints
results = await pool.chat_completions_batch(messages_list, distribute=True)

Gemini Client

from flexllm import GeminiClient

# Gemini Developer API
client = GeminiClient(
    model="gemini-2.5-flash",
    api_key="your-gemini-api-key"
)

# With thinking mode
response = await client.chat_completions(
    messages,
    thinking="high",  # False, True, "minimal", "low", "medium", "high"
)

# Vertex AI mode
client = GeminiClient(
    model="gemini-2.5-flash",
    project_id="your-project-id",
    location="us-central1",
    use_vertex_ai=True,
)

Thinking Mode (DeepSeek, etc.)

from flexllm import OpenAIClient

client = OpenAIClient(
    base_url="https://api.deepseek.com/v1",
    api_key="your-key",
    model="deepseek-reasoner",
)

# Enable thinking
result = await client.chat_completions(
    messages,
    thinking=True,
    return_raw=True,
)

# Parse thinking content
parsed = OpenAIClient.parse_thoughts(result.data)
print("Thinking:", parsed["thought"])
print("Answer:", parsed["answer"])

CLI Usage

# Quick ask (for scripts/agents)
flexllm ask "What is Python?"
flexllm ask "Explain this" -s "You are a code expert"
echo "long text" | flexllm ask "Summarize"

# Interactive chat
flexllm chat
flexllm chat "Hello"
flexllm chat --model=gpt-4 "Hello"

# List models
flexllm models           # Remote models
flexllm list_models      # Configured models

# Test connection
flexllm test

# Initialize config
flexllm init

CLI Configuration

Create ~/.flexllm/config.yaml:

default: "gpt-4"

models:
  - id: gpt-4
    name: gpt-4
    provider: openai
    base_url: https://api.openai.com/v1
    api_key: your-api-key

  - id: local
    name: local-ollama
    provider: openai
    base_url: http://localhost:11434/v1
    api_key: EMPTY

Or use environment variables:

export FLEXLLM_BASE_URL="https://api.openai.com/v1"
export FLEXLLM_API_KEY="your-key"
export FLEXLLM_MODEL="gpt-4"

API Reference

LLMClient

Main client for OpenAI-compatible APIs.

LLMClient(
    model: str,                    # Model name
    base_url: str,                 # API base URL
    api_key: str = "EMPTY",        # API key
    provider: str = "auto",        # "auto", "openai", "gemini"
    cache: ResponseCacheConfig = None,  # Cache config
    concurrency_limit: int = 50,   # Max concurrent requests
    max_qps: float = None,         # Max requests per second
    retry_times: int = 3,          # Retry count on failure
    retry_delay: float = 1.0,      # Delay between retries
    timeout: int = 120,            # Request timeout (seconds)
)

Methods

Method Description
chat_completions(messages) Single async request
chat_completions_sync(messages) Single sync request
chat_completions_batch(messages_list) Batch async requests
chat_completions_batch_sync(messages_list) Batch sync requests
chat_completions_stream(messages) Streaming response

ResponseCacheConfig

ResponseCacheConfig(
    enabled: bool = False,         # Enable caching
    ttl: int = 86400,              # Time-to-live in seconds (default 24h)
    cache_dir: str = "~/.cache/flexllm/llm_response",
    use_ipc: bool = True,          # Use IPC for multi-process sharing
)

# Shortcuts
ResponseCacheConfig.with_ttl(3600)     # 1 hour TTL
ResponseCacheConfig.persistent()        # Never expire

Token Counting

from flexllm import count_tokens, estimate_cost, estimate_batch_cost

# Count tokens
tokens = count_tokens("Hello world", model="gpt-4")

# Estimate cost
cost = estimate_cost(tokens, model="gpt-4", is_input=True)

# Estimate batch cost
total_cost = estimate_batch_cost(messages_list, model="gpt-4")

Architecture

flexllm/
├── flexllm/
│   ├── llm_client.py          # Unified client (recommended)
│   ├── openaiclient.py        # OpenAI-compatible API
│   ├── geminiclient.py        # Google Gemini
│   ├── mllm_client.py         # Multi-modal client
│   ├── client_pool.py         # Load balancing pool
│   ├── response_cache.py      # Response caching
│   ├── token_counter.py       # Token counting & cost
│   ├── async_api/             # Async engine
│   └── processors/            # Image & message processing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexllm-0.2.2.tar.gz (108.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flexllm-0.2.2-py3-none-any.whl (115.3 kB view details)

Uploaded Python 3

File details

Details for the file flexllm-0.2.2.tar.gz.

File metadata

  • Download URL: flexllm-0.2.2.tar.gz
  • Upload date:
  • Size: 108.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flexllm-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c5bb77fdc371e50867510a5dc7535871f75513f99aa5211fb2efc99de12c318a
MD5 50a8c84762235c67ba4379057bc7b58f
BLAKE2b-256 b4cca9e07dd033733ee69919bbcbaaf0f14a96f1c76e97929535728c82c61f94

See more details on using hashes here.

Provenance

The following attestation bundles were made for flexllm-0.2.2.tar.gz:

Publisher: python-publish.yml on KenyonY/flexllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flexllm-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: flexllm-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 115.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flexllm-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 89ad0cc6e8bc9463efba1afc933dabe0317cc42bfaddc63515f1326659030591
MD5 6a0607b8dd16bb837760bf5a8a02cb53
BLAKE2b-256 338b11d1f09e1673e90832b3571008ce61a3da60459dda68999e394b6b1f8d3d

See more details on using hashes here.

Provenance

The following attestation bundles were made for flexllm-0.2.2-py3-none-any.whl:

Publisher: python-publish.yml on KenyonY/flexllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page