High-performance LLM client with batch processing, caching, and checkpoint recovery

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

K.y

These details have not been verified by PyPI

Project description

flexllm

High-performance LLM client with batch processing, caching, and checkpoint recovery

Why flexllm?

flexllm is designed for production LLM applications where reliability and efficiency matter:

One Interface, Multiple Providers: Write code once, switch between OpenAI, Gemini, Claude, or self-hosted models (vLLM, Ollama) without changing your application logic.
Production-Ready: Built-in retry, timeout, QPS limiting, and checkpoint recovery - handle API failures gracefully without losing progress on large batch jobs.
Simple by Design: KISS principle - minimal configuration, sensible defaults, and a clean API that stays out of your way.

Features

Batch Processing: Process thousands of requests concurrently with QPS control
Response Caching: Built-in caching with TTL support, avoid duplicate API calls
Checkpoint Recovery: Resume interrupted batch jobs automatically
Multi-Provider: OpenAI, Gemini, Claude, and any OpenAI-compatible API (vLLM, Ollama, DeepSeek, Qwen...)
Function Calling: Unified tool use support across providers
Multi-Modal: Image + text processing with automatic base64 encoding
Load Balancing: Multi-endpoint client pool with failover
Async-First: Built on asyncio for maximum performance
CLI Tool: Quick ask, chat, and test commands

Installation

pip install flexllm

# With caching support
pip install flexllm[cache]

# With CLI support
pip install flexllm[cli]

# All features
pip install flexllm[all]

Quick Start

Single Request

from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key"
)

# Async
response = await client.chat_completions([
    {"role": "user", "content": "Hello!"}
])

# Sync
response = client.chat_completions_sync([
    {"role": "user", "content": "Hello!"}
])

Batch Processing with Checkpoint Recovery

from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    concurrency_limit=50,
    max_qps=100,
)

messages_list = [
    [{"role": "user", "content": "What is 1+1?"}],
    [{"role": "user", "content": "What is 2+2?"}],
    # ... thousands more
]

# Batch processing with checkpoint recovery
# If interrupted, re-running will resume from where it stopped
results = await client.chat_completions_batch(
    messages_list,
    output_file="results.jsonl",  # Auto-save progress
    show_progress=True,
)

Response Caching

from flexllm import LLMClient, ResponseCacheConfig

# Enable caching (avoid duplicate API calls)
client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    cache=ResponseCacheConfig(enabled=True, ttl=3600),  # 1 hour TTL
)

# Duplicate requests hit cache automatically
result1 = await client.chat_completions(messages)  # API call
result2 = await client.chat_completions(messages)  # Cache hit (instant)

Streaming Response

async for chunk in client.chat_completions_stream(messages):
    print(chunk, end="", flush=True)

Multi-Modal (Vision)

from flexllm import MllmClient

client = MllmClient(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    model="gpt-4o",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "path/to/image.jpg"}}  # Local path or URL
        ]
    }
]

response = await client.call_llm([messages])

Load Balancing with Failover

from flexllm import LLMClientPool

# Create client pool with multiple endpoints
pool = LLMClientPool(
    endpoints=[
        {"base_url": "http://host1:8000/v1", "api_key": "key1", "model": "qwen"},
        {"base_url": "http://host2:8000/v1", "api_key": "key2", "model": "qwen"},
    ],
    load_balance="round_robin",  # round_robin, weighted, random, fallback
    fallback=True,  # Auto switch on failure
)

# Same API as LLMClient
result = await pool.chat_completions(messages)

# Distribute batch requests across endpoints
results = await pool.chat_completions_batch(messages_list, distribute=True)

Gemini Client

from flexllm import GeminiClient

# Gemini Developer API
client = GeminiClient(
    model="gemini-3-flash-preview",
    api_key="your-gemini-api-key"
)

# With thinking mode
response = await client.chat_completions(
    messages,
    thinking="high",  # False, True, "minimal", "low", "medium", "high"
)

# Vertex AI mode
client = GeminiClient(
    model="gemini-3-flash-preview",
    project_id="your-project-id",
    location="us-central1",
    use_vertex_ai=True,
)

Claude Client

from flexllm import LLMClient, ClaudeClient

# Using unified LLMClient (recommended)
client = LLMClient(
    provider="claude",
    api_key="your-anthropic-key",
    model="claude-3-5-sonnet-20241022",
)

response = await client.chat_completions([
    {"role": "user", "content": "Hello, Claude!"}
])

# Or use ClaudeClient directly
client = ClaudeClient(
    api_key="your-anthropic-key",
    model="claude-3-5-sonnet-20241022",
)

# With extended thinking
response = await client.chat_completions(
    messages,
    thinking=True,  # or budget_tokens as int
    return_raw=True,
)
parsed = ClaudeClient.parse_thoughts(response.data)

Function Calling (Tool Use)

from flexllm import LLMClient

client = LLMClient(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    model="gpt-4",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

# Returns ChatCompletionResult with tool_calls
result = await client.chat_completions(
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    return_usage=True,
)

if result.tool_calls:
    for tool_call in result.tool_calls:
        print(f"Function: {tool_call.function['name']}")
        print(f"Arguments: {tool_call.function['arguments']}")

Thinking Mode (DeepSeek, etc.)

from flexllm import OpenAIClient

client = OpenAIClient(
    base_url="https://api.deepseek.com/v1",
    api_key="your-key",
    model="deepseek-reasoner",
)

# Enable thinking
result = await client.chat_completions(
    messages,
    thinking=True,
    return_raw=True,
)

# Parse thinking content
parsed = OpenAIClient.parse_thoughts(result.data)
print("Thinking:", parsed["thought"])
print("Answer:", parsed["answer"])

CLI Usage

# Quick ask (for scripts/agents)
flexllm ask "What is Python?"
flexllm ask "Explain this" -s "You are a code expert"
echo "long text" | flexllm ask "Summarize"

# Interactive chat
flexllm chat
flexllm chat "Hello"
flexllm chat --model=gpt-4 "Hello"

# List models
flexllm models           # Remote models
flexllm list_models      # Configured models

# Test connection
flexllm test

# Initialize config
flexllm init

CLI Configuration

Create ~/.flexllm/config.yaml:

default: "gpt-4"

models:
  - id: gpt-4
    name: gpt-4
    provider: openai
    base_url: https://api.openai.com/v1
    api_key: your-api-key

  - id: local
    name: local-ollama
    provider: openai
    base_url: http://localhost:11434/v1
    api_key: EMPTY

Or use environment variables:

export FLEXLLM_BASE_URL="https://api.openai.com/v1"
export FLEXLLM_API_KEY="your-key"
export FLEXLLM_MODEL="gpt-4"

API Reference

LLMClient

Main unified client for all providers.

LLMClient(
    model: str,                    # Model name
    base_url: str,                 # API base URL
    api_key: str = "EMPTY",        # API key
    provider: str = "auto",        # "auto", "openai", "gemini", "claude"
    cache: ResponseCacheConfig = None,  # Cache config
    concurrency_limit: int = 50,   # Max concurrent requests
    max_qps: float = None,         # Max requests per second
    retry_times: int = 3,          # Retry count on failure
    retry_delay: float = 1.0,      # Delay between retries
    timeout: int = 120,            # Request timeout (seconds)
)

Methods

Method	Description
`chat_completions(messages)`	Single async request
`chat_completions_sync(messages)`	Single sync request
`chat_completions_batch(messages_list)`	Batch async requests
`chat_completions_batch_sync(messages_list)`	Batch sync requests
`chat_completions_stream(messages)`	Streaming response

ResponseCacheConfig

ResponseCacheConfig(
    enabled: bool = False,         # Enable caching
    ttl: int = 86400,              # Time-to-live in seconds (default 24h)
    cache_dir: str = "~/.cache/flexllm/llm_response",
    use_ipc: bool = True,          # Use IPC for multi-process sharing
)

# Shortcuts
ResponseCacheConfig.with_ttl(3600)     # 1 hour TTL
ResponseCacheConfig.persistent()        # Never expire

Token Counting

from flexllm import count_tokens, estimate_cost, estimate_batch_cost

# Count tokens
tokens = count_tokens("Hello world", model="gpt-4")

# Estimate cost
cost = estimate_cost(tokens, model="gpt-4", is_input=True)

# Estimate batch cost
total_cost = estimate_batch_cost(messages_list, model="gpt-4")

Architecture

flexllm/
├── flexllm/
│   ├── llm_client.py          # Unified client (recommended)
│   ├── openaiclient.py        # OpenAI-compatible API
│   ├── geminiclient.py        # Google Gemini
│   ├── claudeclient.py        # Anthropic Claude
│   ├── mllm_client.py         # Multi-modal client
│   ├── client_pool.py         # Load balancing pool
│   ├── response_cache.py      # Response caching
│   ├── token_counter.py       # Token counting & cost
│   ├── async_api/             # Async engine
│   └── processors/            # Image & message processing

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

K.y

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.11.0

May 19, 2026

0.10.4

Apr 13, 2026

0.10.3

Apr 8, 2026

0.10.2

Apr 6, 2026

0.10.1

Apr 4, 2026

0.10.0

Mar 22, 2026

0.9.1

Mar 17, 2026

0.9.0

Mar 15, 2026

0.8.5

Mar 10, 2026

0.8.4

Mar 10, 2026

0.8.3

Mar 10, 2026

0.8.2

Mar 8, 2026

0.8.1

Mar 7, 2026

0.8.0

Mar 7, 2026

0.7.3

Mar 3, 2026

0.7.2

Mar 3, 2026

0.7.0

Feb 25, 2026

0.6.2

Feb 24, 2026

0.6.1

Feb 7, 2026

0.6.0

Mar 17, 2026

0.5.8

Feb 3, 2026

0.5.7

Feb 2, 2026

0.5.6

Feb 1, 2026

0.5.5

Feb 1, 2026

0.5.4

Jan 31, 2026

0.5.3

Jan 31, 2026

0.5.2

Jan 31, 2026

0.5.1

Jan 31, 2026

0.5.0

Jan 30, 2026

0.4.5

Jan 23, 2026

0.4.4

Jan 22, 2026

0.4.3

Jan 21, 2026

0.4.2

Jan 21, 2026

0.4.1

Jan 20, 2026

0.4.0

Jan 19, 2026

0.3.4

Jan 18, 2026

0.3.3

Jan 18, 2026

0.3.1

Jan 17, 2026

This version

0.3.0

Jan 11, 2026

0.2.2

Jan 7, 2026

0.2.1

Jan 7, 2026

0.2.0

Jan 6, 2026

0.1.0

Jan 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexllm-0.3.0.tar.gz (121.0 kB view details)

Uploaded Jan 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flexllm-0.3.0-py3-none-any.whl (123.1 kB view details)

Uploaded Jan 11, 2026 Python 3

File details

Details for the file flexllm-0.3.0.tar.gz.

File metadata

Download URL: flexllm-0.3.0.tar.gz
Upload date: Jan 11, 2026
Size: 121.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flexllm-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`ebdae66782bbe2574b44d250d88bff4be5a53bdd403c1c9d22ed364d995fe28d`
MD5	`df7fa41944a256d32d3baa8376642c43`
BLAKE2b-256	`41cc2c627739d3acf7e3fb46b14b0a87a77b442a8338e64392b12cd14b584afe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for flexllm-0.3.0.tar.gz:

Publisher: python-publish.yml on KenyonY/flexllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: flexllm-0.3.0.tar.gz
- Subject digest: ebdae66782bbe2574b44d250d88bff4be5a53bdd403c1c9d22ed364d995fe28d
- Sigstore transparency entry: 813640958
- Sigstore integration time: Jan 11, 2026
Source repository:
- Permalink: KenyonY/flexllm@6890b7a4d668f87dc4dc8c4396ef9cafb82d21e7
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/KenyonY
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@6890b7a4d668f87dc4dc8c4396ef9cafb82d21e7
- Trigger Event: push

File details

Details for the file flexllm-0.3.0-py3-none-any.whl.

File metadata

Download URL: flexllm-0.3.0-py3-none-any.whl
Upload date: Jan 11, 2026
Size: 123.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flexllm-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4f2f459482e296181f86306457dfd8229acc67ccc0f388ce9e30df385bca5060`
MD5	`8f9c1babbe3de837a6b3668c383b7f59`
BLAKE2b-256	`fd40b7293a045cb06365a089114d9205cae37eb54d8174c441e960912f99ff54`

See more details on using hashes here.

Provenance

The following attestation bundles were made for flexllm-0.3.0-py3-none-any.whl:

Publisher: python-publish.yml on KenyonY/flexllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: flexllm-0.3.0-py3-none-any.whl
- Subject digest: 4f2f459482e296181f86306457dfd8229acc67ccc0f388ce9e30df385bca5060
- Sigstore transparency entry: 813640960
- Sigstore integration time: Jan 11, 2026
Source repository:
- Permalink: KenyonY/flexllm@6890b7a4d668f87dc4dc8c4396ef9cafb82d21e7
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/KenyonY
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@6890b7a4d668f87dc4dc8c4396ef9cafb82d21e7
- Trigger Event: push

flexllm 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

flexllm

Why flexllm?

Features

Installation

Quick Start

Single Request

Batch Processing with Checkpoint Recovery

Response Caching

Streaming Response

Multi-Modal (Vision)

Load Balancing with Failover

Gemini Client

Claude Client

Function Calling (Tool Use)

Thinking Mode (DeepSeek, etc.)

CLI Usage

CLI Configuration

API Reference

LLMClient

Methods

ResponseCacheConfig

Token Counting

Architecture

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance