Python utility for using LLM API models.

Project description

lm-deluge

lm-deluge is a lightweight helper library for maxing out your rate limits with LLM providers. It provides the following:

Unified client – Send prompts to all relevant models with a single client.
Massive concurrency with throttling – Set max_tokens_per_minute and max_requests_per_minute and let it fly. The client will process as many requests as possible while respecting rate limits and retrying failures.
Spray across models/providers – Configure a client with multiple models from any provider(s), and sampling weights. The client samples a model for each request.
Tool Use – Unified API for defining tools for all providers, and creating tools automatically from python functions.
MCP Support – Instantiate a Tool from a local or remote MCP server so that any LLM can use it, whether or not that provider natively supports MCP.
Caching – Save completions in a local or distributed cache to avoid repeated LLM calls to process the same input.
Convenient message constructor – No more looking up how to build an Anthropic messages list with images. Our Conversation and Message classes work great with our client or with the openai and anthropic packages.
Sync and async APIs – Use the client from sync or async code.

STREAMING IS NOT IN SCOPE. There are plenty of packages that let you stream chat completions across providers. The sole purpose of this package is to do very fast batch inference using APIs. Sorry!

Installation

pip install lm-deluge

The package relies on environment variables for API keys. Typical variables include OPENAI_API_KEY, ANTHROPIC_API_KEY, COHERE_API_KEY, META_API_KEY, and GOOGLE_API_KEY. LLMClient will automatically load the .env file when imported; we recommend using that to set the environment variables.

Quickstart

The easiest way to get started is with the .basic constructor. This uses sensible default arguments for rate limits and sampling parameters so that you don't have to provide a ton of arguments.

from lm_deluge import LLMClient

client = LLMClient.basic("gpt-4o-mini")
resps = client.process_prompts_sync(["Hello, world!"])
print(resp[0].completion)

Spraying Across Models

To distribute your requests across models, just provide a list of more than one model to the constructor. The rate limits for the client apply to the client as a whole, not per-model, so you may want to increase them:

from lm_deluge import LLMClient

client = LLMClient.basic(
    ["gpt-4o-mini", "claude-3-haiku"],
    max_requests_per_minute=10_000
)
resps = client.process_prompts_sync(
    ["Hello, ChatGPT!", "Hello, Claude!"]
)
print(resp[0].completion)

Configuration

API calls can be customized in a few ways.

Sampling Parameters. This determines things like structured outputs, maximum completion tokens, nucleus sampling, etc. Provide a custom SamplingParams to the LLMClient to set temperature, top_p, json_mode, max_new_tokens, and/or reasoning_effort. You can pass 1 SamplingParams to use for all models, or a list of SamplingParams that's the same length as the list of models. You can also pass many of these arguments directly to LLMClient.basic so you don't have to construct an entire SamplingParams object.
Arguments to LLMClient. This is where you set request timeout, rate limits, model name(s), model weight(s) for distributing requests across models, retries, and caching.
Arguments to process_prompts. Per-call, you can set verbosity, whether to display progress, and whether to return just completions (rather than the full APIResponse object).

Putting it all together:

from lm_deluge import LLMClient, SamplingParams

client = LLMClient(
    "gpt-4",
    max_requests_per_minute=100,
    max_tokens_per_minute=100_000,
    max_concurrent_requests=500,
    sampling_params=SamplingParams(temperature=0.5, max_new_tokens=30)
)

await client.process_prompts_async(
    ["What is the capital of Mars?"],
    show_progress=False,
    return_completions_only=True
)

Multi-Turn Conversations

Constructing conversations to pass to models is notoriously annoying. Each provider has a slightly different way of defining a list of messages, and with the introduction of images/multi-part messages it's only gotten worse. We provide convenience constructors so you don't have to remember all that stuff.

from lm_deluge import Message, Conversation

prompt = Conversation.system("You are a helpful assistant.").add(
    Message.user("What's in this image?").add_image("tests/image.jpg")
)

client = LLMClient.basic("gpt-4.1-mini")
resps = client.process_prompts_sync([prompt])

This just works. Images can be local images on disk, URLs, bytes, base64 data URLs... go wild. You can use Conversation.to_openai or Conversation.to_anthropic to format your messages for the OpenAI or Anthropic clients directly.

Basic Tool Use

Define tools from Python functions and use them with any model:

from lm_deluge import LLMClient, Tool

def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny and 72°F"

tool = Tool.from_function(get_weather)
client = LLMClient.basic("claude-3-haiku")
resp = client.process_prompts_sync(["What's the weather in Paris?"], tools=[tool])

MCP Integration

Connect to MCP servers to extend your models with external tools:

from lm_deluge import LLMClient, Tool

# Connect to a local MCP server
mcp_tool = Tool.from_mcp("filesystem", command="npx -y @modelcontextprotocol/server-filesystem", args=["/path/to/directory"])
client = LLMClient.basic("gpt-4o-mini", tools=[mcp_tool])
resp = client.process_prompts_sync(["List the files in the current directory"])

Caching

lm_deluge.cache includes LevelDB, SQLite and custom dictionary based caches. Pass an instance via LLMClient(..., cache=my_cache) and previously seen prompts will not be re‑sent across different process_prompts_[...] calls.

IMPORTANT: Caching does not currently work for prompts in the SAME batch. That is, if you call process_prompts_sync with the same prompt 100 times, there will be 0 cache hits. If you call process_prompts_sync a second time with those same 100 prompts, all 100 will be cache hits. The cache is intended to be persistent and help you save costs across many invocations, but it can't help with a single batch-inference session (yet!).

Asynchronous Client

Use this in asynchronous code, or in a Jupyter notebook. If you try to use the sync client in a Jupyter notebook, you'll have to use nest-asyncio, because internally the sync client uses async code. Don't do it! Just use the async client!

import asyncio

async def main():
    responses = await client.process_prompts_async(
        ["an async call"],
        return_completions_only=True,
    )
    print(responses[0])

asyncio.run(main())

Available Models

We support all models in src/lm_deluge/models.py. An older version of this client supported Bedrock and Vertex. We plan to re-implement Bedrock support (our previous support was spotty and we need to figure out cross-region inference in order to support the newest Claude models). Vertex support is not currently planned, since Google allows you to connect your Vertex account to AI Studio, and Vertex authentication is a huge pain (requires service account credentials, etc.)

Feature Support

We support structured outputs via json_mode parameter provided to SamplingParams. Structured outputs with a schema are planned. Reasoning models are supported via the reasoning_effort parameter, which is translated to a thinking budget for Claude/Gemini. Image models are supported. We don't support tool use yet, but support is planned (keep an eye out for a unified tool definition spec that works for all models!). We support logprobs for OpenAI models that return them via the logprobs argument to the LLMClient.

Built‑in tools

The lm_deluge.llm_tools package exposes a few helper functions:

extract – structure text or images into a Pydantic model based on a schema.
translate – translate a list of strings to English.
score_llm – simple yes/no style scoring with optional log probability output.

Experimental embeddings (embed.embed_parallel_async) and document reranking (rerank.rerank_parallel_async) clients are also provided.

Project details

Release history Release notifications | RSS feed

0.0.138

Apr 8, 2026

0.0.137

Apr 1, 2026

0.0.136

Mar 17, 2026

0.0.135

Mar 9, 2026

0.0.134

Mar 9, 2026

0.0.133

Mar 8, 2026

0.0.132

Mar 6, 2026

0.0.131

Mar 6, 2026

0.0.130

Mar 4, 2026

0.0.129

Mar 4, 2026

0.0.128

Mar 3, 2026

0.0.127

Mar 3, 2026

0.0.126

Feb 27, 2026

0.0.125

Feb 27, 2026

0.0.124

Feb 27, 2026

0.0.123

Feb 25, 2026

0.0.122

Feb 24, 2026

0.0.121

Feb 22, 2026

0.0.120

Feb 20, 2026

0.0.119

Feb 20, 2026

0.0.118

Feb 20, 2026

0.0.117

Feb 18, 2026

0.0.116

Feb 17, 2026

0.0.115

Feb 13, 2026

0.0.114

Feb 12, 2026

0.0.113

Feb 11, 2026

0.0.112

Feb 10, 2026

0.0.111

Feb 9, 2026

0.0.110

Feb 7, 2026

0.0.109

Feb 5, 2026

0.0.108

Feb 4, 2026

0.0.107

Feb 4, 2026

0.0.106

Feb 3, 2026

0.0.105

Feb 3, 2026

0.0.104

Feb 2, 2026

0.0.103

Feb 1, 2026

0.0.102

Jan 30, 2026

0.0.101

Jan 19, 2026

0.0.100

Jan 14, 2026

0.0.99

Jan 11, 2026

0.0.98

Jan 11, 2026

0.0.97

Jan 10, 2026

0.0.96

Jan 8, 2026

0.0.95

Jan 2, 2026

0.0.94

Jan 2, 2026

0.0.93

Jan 2, 2026

0.0.92

Jan 1, 2026

0.0.91

Dec 28, 2025

0.0.90

Dec 27, 2025

0.0.89

Dec 17, 2025

0.0.88

Dec 16, 2025

0.0.87

Dec 11, 2025

0.0.86

Dec 5, 2025

0.0.85

Dec 5, 2025

0.0.84

Dec 5, 2025

0.0.83

Dec 3, 2025

0.0.82

Nov 30, 2025

0.0.81

Nov 29, 2025

0.0.80

Nov 25, 2025

0.0.79

Nov 22, 2025

0.0.78

Nov 19, 2025

0.0.76

Nov 19, 2025

0.0.75

Nov 16, 2025

0.0.74

Nov 16, 2025

0.0.73

Nov 13, 2025

0.0.72

Nov 12, 2025

0.0.71

Nov 11, 2025

0.0.70

Nov 10, 2025

0.0.69

Nov 10, 2025

0.0.68

Nov 2, 2025

0.0.67

Oct 31, 2025

0.0.66

Oct 31, 2025

0.0.65

Oct 31, 2025

0.0.64

Oct 31, 2025

0.0.63

Oct 30, 2025

0.0.62

Oct 23, 2025

0.0.61

Oct 23, 2025

0.0.60

Oct 22, 2025

0.0.59

Oct 19, 2025

0.0.58

Oct 18, 2025

0.0.57

Oct 4, 2025

0.0.56

Oct 1, 2025

0.0.55

Sep 30, 2025

0.0.54

Sep 29, 2025

0.0.53

Sep 29, 2025

0.0.52

Sep 28, 2025

0.0.51

Sep 28, 2025

0.0.50

Sep 16, 2025

0.0.49

Sep 16, 2025

0.0.48

Aug 24, 2025

0.0.47

Aug 24, 2025

0.0.46

Aug 22, 2025

0.0.45

Aug 21, 2025

0.0.44

Aug 21, 2025

0.0.43

Aug 21, 2025

0.0.42

Aug 21, 2025

0.0.41

Aug 17, 2025

0.0.40

Aug 16, 2025

0.0.39

Aug 15, 2025

0.0.38

Aug 15, 2025

0.0.37

Aug 15, 2025

0.0.36

Aug 15, 2025

0.0.35

Aug 8, 2025

0.0.34

Aug 6, 2025

0.0.33

Aug 6, 2025

0.0.32

Aug 1, 2025

0.0.31

Aug 1, 2025

0.0.30

Jul 30, 2025

0.0.29

Jul 30, 2025

0.0.28

Jul 27, 2025

0.0.27

Jul 26, 2025

0.0.26

Jul 26, 2025

0.0.25

Jul 24, 2025

0.0.24

Jul 23, 2025

0.0.23

Jul 23, 2025

0.0.22

Jul 23, 2025

0.0.21

Jul 9, 2025

0.0.20

Jul 9, 2025

0.0.19

Jul 9, 2025

0.0.18

Jul 9, 2025

0.0.17

Jun 24, 2025

0.0.16

Jun 9, 2025

0.0.15

Jun 3, 2025

0.0.14

Jun 2, 2025

0.0.13

May 30, 2025

0.0.12

May 25, 2025

This version

0.0.11

May 24, 2025

0.0.10

May 24, 2025

0.0.9

May 23, 2025

0.0.8

May 22, 2025

0.0.7

May 22, 2025

0.0.6

May 22, 2025

0.0.5

May 21, 2025

0.0.4

May 21, 2025

0.0.3

May 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lm_deluge-0.0.11-py3-none-any.whl (74.3 kB view details)

Uploaded May 24, 2025 Python 3

File details

Details for the file lm_deluge-0.0.11-py3-none-any.whl.

File metadata

Download URL: lm_deluge-0.0.11-py3-none-any.whl
Upload date: May 24, 2025
Size: 74.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for lm_deluge-0.0.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eeeec06c1a5f02214c8802dc8122e4935363e1481801b692c934f42531c2d332`
MD5	`6261f2d4f2a11e5198865850e83f581d`
BLAKE2b-256	`58ecb6b22d1c9b4c6188b49373999329bb97f269a0372dc3d2ea0bab82fdb40e`

See more details on using hashes here.

lm-deluge 0.0.11

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

lm-deluge

Installation

Quickstart

Spraying Across Models

Configuration

Multi-Turn Conversations

Basic Tool Use

MCP Integration

Caching

Asynchronous Client

Available Models

Feature Support

Built‑in tools

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes