Skip to main content

Rate limiter for LLM

Project description

LLMRateLimiter

Release Build status codecov License

Client-side rate limiting for LLM API calls using Redis-backed FIFO queues.

Features

  • FIFO Queue-Based: Fair ordering prevents thundering herd problems
  • Distributed: Redis-backed for multi-process/multi-server deployments
  • Flexible Limits: Supports combined TPM, split input/output TPM, or both
  • Automatic Retry: Exponential backoff with jitter for Redis connection issues
  • Graceful Degradation: Allows requests through on Redis failure

How It Works

flowchart LR
    subgraph Client["Your Application"]
        App[LLM App]
    end

    subgraph RL["LLMRateLimiter"]
        Limiter[RateLimiter]
    end

    subgraph Redis["Redis"]
        Queue[(FIFO Queue<br/>Sorted Set)]
    end

    subgraph LLM["LLM Provider"]
        API[API]
    end

    App -->|1. acquire| Limiter
    Limiter -->|2. Check limits| Queue
    Queue -->|3. Wait time| Limiter
    Limiter -->|4. Return| App
    App -->|5. Call API| API

The rate limiter uses Redis sorted sets to maintain a FIFO queue of requests. Each request records its token consumption, and the Lua script atomically calculates when capacity will be available based on the sliding window.

Installation

pip install llmratelimiter

Or with uv:

uv add llmratelimiter

Quick Start

Combined Mode (OpenAI/Anthropic)

For providers with a single tokens-per-minute limit:

from redis.asyncio import Redis
from llmratelimiter import RateLimiter, RateLimitConfig

redis = Redis(host="localhost", port=6379)
config = RateLimitConfig(tpm=100_000, rpm=100)
limiter = RateLimiter(redis, "gpt-4", config)

# Acquire capacity before making API call
await limiter.acquire(tokens=5000)
response = await openai.chat.completions.create(...)

Split Mode (GCP Vertex AI)

For providers with separate input/output token limits:

config = RateLimitConfig(input_tpm=4_000_000, output_tpm=128_000, rpm=360)
limiter = RateLimiter(redis, "gemini-1.5-pro", config)

# Estimate output tokens upfront
result = await limiter.acquire(input_tokens=5000, output_tokens=2048)
response = await vertex_ai.generate(...)

# Adjust after getting actual output
await limiter.adjust(result.record_id, actual_output=response.output_tokens)

With Connection Manager

For production use with automatic retry:

from llmratelimiter import (
    RateLimiter, RateLimitConfig, RedisConnectionManager, RetryConfig
)

manager = RedisConnectionManager(
    host="localhost",
    port=6379,
    retry_config=RetryConfig(max_retries=3, base_delay=0.1),
)
config = RateLimitConfig(tpm=100_000, rpm=100)
limiter = RateLimiter(manager, "gpt-4", config)

await limiter.acquire(tokens=5000)

Configuration Options

RateLimitConfig

Parameter Description
tpm Combined tokens-per-minute limit
input_tpm Input tokens-per-minute limit
output_tpm Output tokens-per-minute limit
rpm Requests-per-minute limit
window_seconds Sliding window size (default: 60)
burst_multiplier Allow burst above limits (default: 1.0)

RetryConfig

Parameter Description
max_retries Maximum retry attempts (default: 3)
base_delay Initial delay in seconds (default: 0.1)
max_delay Maximum delay cap (default: 5.0)
exponential_base Backoff multiplier (default: 2.0)
jitter Random variation 0-1 (default: 0.1)

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmratelimiter-0.0.1.tar.gz (84.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmratelimiter-0.0.1-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file llmratelimiter-0.0.1.tar.gz.

File metadata

  • Download URL: llmratelimiter-0.0.1.tar.gz
  • Upload date:
  • Size: 84.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmratelimiter-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b740efe68311046b05bc095ba7711e72524789a89d5b8809c58074a59c54b84b
MD5 d35461ac43a2df272b50841e32f800fa
BLAKE2b-256 c84ab756cbf0232b6ea93ff082cb08063d0f0b98e2e0646ae8c281f223966001

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmratelimiter-0.0.1.tar.gz:

Publisher: publish.yml on Ameyanagi/LLMRateLimiter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmratelimiter-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: llmratelimiter-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmratelimiter-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8686b68bbc953f4e083736bf46fdde7df81b35ea7717e6492e37acce01b94da9
MD5 f3b9b8bbb654fd45be6851e657518fc6
BLAKE2b-256 31e141db7935595abfa131ae147649d69403929470db3b807cf71df844cc903c

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmratelimiter-0.0.1-py3-none-any.whl:

Publisher: publish.yml on Ameyanagi/LLMRateLimiter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page