High-performance LLM client with batch processing, caching, and checkpoint recovery

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

flexllm

One Client, All LLMs
Production-grade LLM client with checkpoint recovery, response caching, and multi-provider support

Design Philosophy

One unified entry point for all LLM providers.

from flexllm import LLMClient

# That's all you need to import. Everything else is configuration.

flexllm follows the "Single Interface, Multiple Backends" principle. Whether you're calling OpenAI, Gemini, Claude, or a self-hosted model, the API stays the same. Provider differences are abstracted away - you focus on your application logic, not on SDK quirks.

# OpenAI GPT-4
client = LLMClient(base_url="https://api.openai.com/v1", model="gpt-4", api_key="...")

# Google Gemini
client = LLMClient(provider="gemini", model="gemini-2.0-flash", api_key="...")

# Anthropic Claude
client = LLMClient(provider="claude", model="claude-sonnet-4-20250514", api_key="...")

# Self-hosted (vLLM, Ollama, etc.)
client = LLMClient(base_url="http://localhost:8000/v1", model="qwen2.5")

# The API is identical for all:
result = await client.chat_completions(messages)
results = await client.chat_completions_batch(messages_list)

Features

Feature	Description
Unified Interface	One `LLMClient` for OpenAI, Gemini, Claude, and any OpenAI-compatible API
Checkpoint Recovery	Batch jobs auto-resume from interruption - process millions of requests safely
Response Caching	Built-in caching with TTL and IPC multi-process sharing
Cost Tracking	Real-time cost monitoring with budget control
High-Performance Async	Fine-grained concurrency control, QPS limiting, and streaming
Load Balancing	Multi-endpoint distribution with automatic failover

Installation

pip install flexllm

# With all features
pip install flexllm[all]

Quick Start

Basic Usage

from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key"
)

# Async
response = await client.chat_completions([
    {"role": "user", "content": "Hello!"}
])

# Sync
response = client.chat_completions_sync([
    {"role": "user", "content": "Hello!"}
])

Batch Processing with Checkpoint Recovery

Process millions of requests safely. If interrupted, just restart - it continues from where it left off.

messages_list = [
    [{"role": "user", "content": f"Question {i}"}]
    for i in range(100000)
]

# Interrupted at 50,000? Re-run and it continues from 50,001.
results = await client.chat_completions_batch(
    messages_list,
    output_jsonl="results.jsonl",  # Progress saved here
    show_progress=True,
)

Response Caching

from flexllm import LLMClient, ResponseCacheConfig

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    cache=ResponseCacheConfig(enabled=True, ttl=3600),  # 1 hour TTL
)

# First call: API request (~2s, ~$0.01)
result1 = await client.chat_completions(messages)

# Second call: Cache hit (~0.001s, $0)
result2 = await client.chat_completions(messages)

Cost Tracking

# Track costs during batch processing
results, cost_report = await client.chat_completions_batch(
    messages_list,
    return_cost_report=True,
)
print(f"Total cost: ${cost_report.total_cost:.4f}")

# Real-time cost display in progress bar
results = await client.chat_completions_batch(
    messages_list,
    track_cost=True,  # Shows 💰 $0.0012 in progress bar
)

Streaming

# Token-by-token streaming
async for chunk in client.chat_completions_stream(messages):
    print(chunk, end="", flush=True)

# Batch streaming - process results as they complete
async for result in client.iter_chat_completions_batch(messages_list):
    process(result)

Multi-Provider Support

from flexllm import LLMClient

# OpenAI (auto-detected from base_url)
client = LLMClient(
    base_url="https://api.openai.com/v1",
    api_key="sk-...",
    model="gpt-4o",
)

# Gemini
client = LLMClient(
    provider="gemini",
    api_key="your-gemini-key",
    model="gemini-2.0-flash",
)

# Claude
client = LLMClient(
    provider="claude",
    api_key="your-anthropic-key",
    model="claude-sonnet-4-20250514",
)

# Self-hosted (vLLM, Ollama, etc.)
client = LLMClient(
    base_url="http://localhost:8000/v1",
    model="qwen2.5",
)

Thinking Mode (Reasoning Models)

Unified interface for DeepSeek-R1, Qwen3, Claude extended thinking, Gemini thinking.

result = await client.chat_completions(
    messages,
    thinking=True,      # Enable thinking
    return_raw=True,
)

# Unified parsing across all providers
parsed = client.parse_thoughts(result.data)
print("Thinking:", parsed["thought"])
print("Answer:", parsed["answer"])

Load Balancing

from flexllm import LLMClientPool

pool = LLMClientPool(
    endpoints=[
        {"base_url": "http://gpu1:8000/v1", "model": "qwen"},
        {"base_url": "http://gpu2:8000/v1", "model": "qwen"},
    ],
    load_balance="round_robin",  # or "weighted", "random", "fallback"
    fallback=True,               # Auto-switch on failure
)

# Requests automatically distributed
results = await pool.chat_completions_batch(messages_list, distribute=True)

CLI

# Quick ask
flexllm ask "What is Python?"

# Interactive chat
flexllm chat

# Batch processing with cost tracking
flexllm batch input.jsonl -o output.jsonl --track-cost

# Model management
flexllm list        # Configured models
flexllm models      # Remote available models
flexllm test        # Test connection

Architecture

flexllm/
├── clients/           # All client implementations
│   ├── base.py        # Abstract base class (LLMClientBase)
│   ├── llm.py         # Unified entry point (LLMClient)
│   ├── openai.py      # OpenAI-compatible backend
│   ├── gemini.py      # Google Gemini backend
│   ├── claude.py      # Anthropic Claude backend
│   ├── pool.py        # Multi-endpoint load balancer
│   └── router.py      # Provider routing strategies
├── pricing/           # Cost estimation and tracking
│   ├── cost_tracker.py
│   └── token_counter.py
├── cache/             # Response caching with IPC
├── async_api/         # High-performance async engine
└── msg_processors/    # Multi-modal message processing

The architecture follows a simple layered design:

LLMClient (Unified entry point - recommended)
    │
    ├── Provider auto-detection or explicit selection
    │
    └── Backend Clients (internal)
            ├── OpenAIClient
            ├── GeminiClient
            └── ClaudeClient
                    │
                    └── LLMClientBase (Abstract - 4 methods to implement)
                            │
                            ├── ConcurrentRequester (Async engine)
                            ├── ResponseCache (Caching layer)
                            └── CostTracker (Cost monitoring)

API Reference

LLMClient

LLMClient(
    provider: str = "auto",        # "auto", "openai", "gemini", "claude"
    model: str,                    # Model name
    base_url: str = None,          # API base URL (required for openai)
    api_key: str = "EMPTY",        # API key
    cache: ResponseCacheConfig,    # Cache config
    concurrency_limit: int = 10,   # Max concurrent requests
    max_qps: float = None,         # Max requests per second
    retry_times: int = 3,          # Retry count on failure
    timeout: int = 120,            # Request timeout (seconds)
)

Main Methods

Method	Description
`chat_completions(messages)`	Single async request
`chat_completions_sync(messages)`	Single sync request
`chat_completions_batch(messages_list)`	Batch async with checkpoint
`iter_chat_completions_batch(messages_list)`	Streaming batch results
`chat_completions_stream(messages)`	Token-by-token streaming

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

K.y

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.11.0

May 19, 2026

0.10.4

Apr 13, 2026

0.10.3

Apr 8, 2026

0.10.2

Apr 6, 2026

0.10.1

Apr 4, 2026

0.10.0

Mar 22, 2026

0.9.1

Mar 17, 2026

0.9.0

Mar 15, 2026

0.8.5

Mar 10, 2026

0.8.4

Mar 10, 2026

0.8.3

Mar 10, 2026

0.8.2

Mar 8, 2026

0.8.1

Mar 7, 2026

0.8.0

Mar 7, 2026

0.7.3

Mar 3, 2026

0.7.2

Mar 3, 2026

0.7.0

Feb 25, 2026

0.6.2

Feb 24, 2026

0.6.1

Feb 7, 2026

0.6.0

Mar 17, 2026

0.5.8

Feb 3, 2026

0.5.7

Feb 2, 2026

0.5.6

Feb 1, 2026

0.5.5

Feb 1, 2026

0.5.4

Jan 31, 2026

0.5.3

Jan 31, 2026

0.5.2

Jan 31, 2026

0.5.1

Jan 31, 2026

0.5.0

Jan 30, 2026

0.4.5

Jan 23, 2026

0.4.4

Jan 22, 2026

0.4.3

Jan 21, 2026

0.4.2

Jan 21, 2026

This version

0.4.1

Jan 20, 2026

0.4.0

Jan 19, 2026

0.3.4

Jan 18, 2026

0.3.3

Jan 18, 2026

0.3.1

Jan 17, 2026

0.3.0

Jan 11, 2026

0.2.2

Jan 7, 2026

0.2.1

Jan 7, 2026

0.2.0

Jan 6, 2026

0.1.0

Jan 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexllm-0.4.1.tar.gz (135.9 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flexllm-0.4.1-py3-none-any.whl (136.5 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file flexllm-0.4.1.tar.gz.

File metadata

Download URL: flexllm-0.4.1.tar.gz
Upload date: Jan 20, 2026
Size: 135.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flexllm-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`8db6ea0f7b2642e3138c136184e34c666168b41ee685c8aaec48ddb05994aad6`
MD5	`6414b41dfa520944db2049339b08114f`
BLAKE2b-256	`1d6a823dfd4537f0bf8c2ac0dd2fa27fab257db9602d45b32ab9e513cf269b02`

See more details on using hashes here.

Provenance

The following attestation bundles were made for flexllm-0.4.1.tar.gz:

Publisher: python-publish.yml on KenyonY/flexllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: flexllm-0.4.1.tar.gz
- Subject digest: 8db6ea0f7b2642e3138c136184e34c666168b41ee685c8aaec48ddb05994aad6
- Sigstore transparency entry: 836630464
- Sigstore integration time: Jan 20, 2026
Source repository:
- Permalink: KenyonY/flexllm@c16e25c81ea77a3279cbc9c161bd1b673ed8fc1b
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/KenyonY
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c16e25c81ea77a3279cbc9c161bd1b673ed8fc1b
- Trigger Event: push

File details

Details for the file flexllm-0.4.1-py3-none-any.whl.

File metadata

Download URL: flexllm-0.4.1-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 136.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for flexllm-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6f3a903e43141ea11b459f6cdcb185d23b4da985cd57575fd7725b901bdff6d`
MD5	`7a961cc9889d78de676cdc98236e75d3`
BLAKE2b-256	`83715a20764c5db625a80f37d78bbc3b98007c6eacbd39cf17008b8ebce4b69a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for flexllm-0.4.1-py3-none-any.whl:

Publisher: python-publish.yml on KenyonY/flexllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: flexllm-0.4.1-py3-none-any.whl
- Subject digest: a6f3a903e43141ea11b459f6cdcb185d23b4da985cd57575fd7725b901bdff6d
- Sigstore transparency entry: 836630488
- Sigstore integration time: Jan 20, 2026
Source repository:
- Permalink: KenyonY/flexllm@c16e25c81ea77a3279cbc9c161bd1b673ed8fc1b
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/KenyonY
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c16e25c81ea77a3279cbc9c161bd1b673ed8fc1b
- Trigger Event: push

flexllm 0.4.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

flexllm

Design Philosophy

Features

Installation

Quick Start

Basic Usage

Batch Processing with Checkpoint Recovery

Response Caching

Cost Tracking

Streaming

Multi-Provider Support

Thinking Mode (Reasoning Models)

Load Balancing

CLI

Architecture

API Reference

LLMClient

Main Methods

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance