Skip to main content

Fleet-level batch dispatcher for LLM APIs. Pool requests across coroutines, route to provider Batch APIs, save 50% on cost without rewriting your agent loops.

Project description

llmfleet

Fleet-level batch dispatcher for LLM APIs. Pool requests from many coroutines, route to provider Batch APIs (50% discount on Anthropic, OpenAI), keep your sync agent loops working.

pip install llmfleet

Why

Anthropic's Batch API saves 50% on input tokens, but it's terrible for one agent — single requests poll for 90–120s. The right unit of batching isn't one user's turn; it's a fleet of agents' turns pooled together by a layer the user never sees. llmfleet is that layer.

Quick start

import asyncio
from anthropic import AsyncAnthropic
from llmfleet import FleetDispatcher, RoutingPolicy

async def main():
    client = AsyncAnthropic()
    policy = RoutingPolicy(
        sync_max_latency_ms=5_000,    # interactive paths stay sync
        batch_window_ms=30_000,       # otherwise pool for 30s
        batch_min_size=10,
        batch_max_size=100,
    )

    async with FleetDispatcher(client, policy=policy) as fleet:
        # Tight latency → sync
        chat = await fleet.submit(
            latency_budget_ms=2_000,
            model="claude-sonnet-4-20250514",
            max_tokens=200,
            messages=[{"role": "user", "content": "Hi"}],
        )

        # Loose latency → pooled into a batch
        graded = await asyncio.gather(*[
            fleet.submit(
                latency_budget_ms=600_000,
                model="claude-sonnet-4-20250514",
                max_tokens=200,
                messages=[{"role": "user", "content": f"Grade: {essay}"}],
            )
            for essay in essays
        ])

asyncio.run(main())

How it works

FleetDispatcher runs a background flusher coroutine. Calls to submit() either:

  • Run synchronously if latency_budget_ms <= policy.sync_max_latency_ms, or
  • Get queued. When the queue holds batch_min_size items or batch_window_ms elapses, the flusher submits one Anthropic Message Batch, polls until completion, and dispatches results back to each awaiting coroutine via its Future.

Concurrent submit() calls from independent coroutines automatically share batches.

API

FleetDispatcher(client, policy=None, on_batch_submitted=None)

# Lifecycle: use as async context manager
async with FleetDispatcher(client) as fleet:
    response = await fleet.submit(latency_budget_ms=N, **anthropic_messages_create_kwargs)

# Force routing decisions
await fleet.submit_sync(**kwargs)
await fleet.submit_batch(**kwargs)

# Introspection
fleet.stats.sync_calls
fleet.stats.batched_calls
fleet.stats.batches_submitted

Configuration

RoutingPolicy(
    sync_max_latency_ms=5_000,    # threshold for sync routing
    batch_window_ms=30_000,       # how long to wait for a batch to fill
    batch_min_size=1,             # minimum size before flushing early
    batch_max_size=100,           # hard cap (Anthropic supports up to 10k)
    poll_interval_s=2.0,          # batch status poll interval
)

What it doesn't do

  • Not a router across providers/models for quality. Use a real router.
  • Not cross-process pooling — fleet is process-local. Use a shared queue (Redis / SQS) for cross-process.
  • Doesn't try to batch tool-call turns where the tool is on the critical path; pass force_sync=True for those.

Status

v0.1.0: Anthropic only. OpenAI Batch API and Bedrock async-invoke are on the roadmap. Patches welcome.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmfleet-0.1.0.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmfleet-0.1.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file llmfleet-0.1.0.tar.gz.

File metadata

  • Download URL: llmfleet-0.1.0.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for llmfleet-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dce0daeaf67b76c549c91644565dc1e760db6564c72c8a64ecc879e72d9777ac
MD5 876e01acb7b79fb41469c759e2cf8d77
BLAKE2b-256 fc210d6f913a12a905ebd8846b6a2ffb3cb93d9574c106cc2c61647e105ccc36

See more details on using hashes here.

File details

Details for the file llmfleet-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llmfleet-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for llmfleet-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1760c4d285caf410af5672543ea7504e8a0555633ab7f6fefb0e0804ac94a395
MD5 82b8c5d7bc58ffac115dd930ea8da766
BLAKE2b-256 c5c8a0b9eafe6dc56c9e083ca7a0cf7d085efefd2c5bc5217a9aeed10241ae5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page