Skip to main content

Rate limiter for LLM

Project description

LLMRateLimiter

Release Build status codecov License

Client-side rate limiting for LLM API calls using Redis-backed FIFO queues.

Features

  • FIFO Queue-Based: Fair ordering prevents thundering herd problems
  • Distributed: Redis-backed for multi-process/multi-server deployments
  • Flexible Limits: Supports combined TPM, split input/output TPM, or both
  • Automatic Retry: Exponential backoff with jitter for Redis connection issues
  • Graceful Degradation: Allows requests through on Redis failure

How It Works

flowchart LR
    subgraph Client["Your Application"]
        App[LLM App]
    end

    subgraph RL["LLMRateLimiter"]
        Limiter[RateLimiter]
    end

    subgraph Redis["Redis"]
        Queue[(FIFO Queue<br/>Sorted Set)]
    end

    subgraph LLM["LLM Provider"]
        API[API]
    end

    App -->|1. acquire| Limiter
    Limiter -->|2. Check limits| Queue
    Queue -->|3. Wait time| Limiter
    Limiter -->|4. Return| App
    App -->|5. Call API| API

The rate limiter uses Redis sorted sets to maintain a FIFO queue of requests. Each request records its token consumption, and the Lua script atomically calculates when capacity will be available based on the sliding window.

Installation

pip install llmratelimiter

Or with uv:

uv add llmratelimiter

Quick Start

Basic Usage

from llmratelimiter import RateLimiter

# Just pass a Redis URL and your limits
limiter = RateLimiter("redis://localhost:6379", "gpt-4", tpm=100_000, rpm=100)

# Recommended: specify input and output tokens separately
await limiter.acquire(input_tokens=3000, output_tokens=2000)
response = await openai.chat.completions.create(...)

Split Mode (GCP Vertex AI)

For providers with separate input/output token limits:

limiter = RateLimiter(
    "redis://localhost:6379", "gemini-1.5-pro",
    input_tpm=4_000_000, output_tpm=128_000, rpm=360
)

# Estimate output tokens upfront
result = await limiter.acquire(input_tokens=5000, output_tokens=2048)
response = await vertex_ai.generate(...)

# Adjust after getting actual output
await limiter.adjust(result.record_id, actual_output=response.output_tokens)

AWS Bedrock (Burndown Rate)

AWS Bedrock uses a burndown rate where output tokens count 5x toward TPM:

limiter = RateLimiter(
    "redis://localhost:6379", "claude-sonnet",
    tpm=100_000, rpm=100, burndown_rate=5.0
)

await limiter.acquire(input_tokens=3000, output_tokens=1000)
# TPM consumption: 3000 + (5.0 * 1000) = 8000 tokens

Azure OpenAI (RPS Smoothing)

Azure OpenAI enforces rate limits at sub-second intervals. If you set 600 RPM, Azure actually enforces 10 requests per second. Bursts can trigger 429 errors even when you're under the minute-level limit.

Enable RPS smoothing to prevent burst-triggered rate limits:

# Auto-calculate RPS from RPM (600 RPM = 10 RPS = 100ms minimum gap)
limiter = RateLimiter(
    "redis://localhost:6379", "gpt-4",
    tpm=300_000, rpm=600, smooth_requests=True
)

# Or set explicit RPS for more conservative rate limiting
limiter = RateLimiter(
    "redis://localhost:6379", "gpt-4",
    tpm=300_000, rpm=600, rps=8  # 125ms minimum gap
)

# Custom evaluation interval (Azure may use 1s or 10s intervals)
limiter = RateLimiter(
    "redis://localhost:6379", "gpt-4",
    tpm=300_000, rpm=600, smooth_requests=True, smoothing_interval=10.0
)

With Existing Redis Client

from redis.asyncio import Redis
from llmratelimiter import RateLimiter

redis = Redis(host="localhost", port=6379)
limiter = RateLimiter(redis=redis, model="gpt-4", tpm=100_000, rpm=100)

await limiter.acquire(input_tokens=3000, output_tokens=2000)

With Connection Manager (Production)

For production use with automatic retry and connection pooling:

from llmratelimiter import RateLimiter, RedisConnectionManager, RetryConfig

manager = RedisConnectionManager(
    "redis://localhost:6379",
    retry_config=RetryConfig(max_retries=3, base_delay=0.1),
)
limiter = RateLimiter(manager, "gpt-4", tpm=100_000, rpm=100)

await limiter.acquire(input_tokens=3000, output_tokens=2000)

SSL Connection

Use rediss:// for SSL/TLS connections:

limiter = RateLimiter("rediss://localhost:6379", "gpt-4", tpm=100_000, rpm=100)

Configuration Options

RateLimitConfig

Parameter Description
tpm Combined tokens-per-minute limit
input_tpm Input tokens-per-minute limit
output_tpm Output tokens-per-minute limit
rpm Requests-per-minute limit
window_seconds Sliding window size (default: 60)
burst_multiplier Allow burst above limits (default: 1.0)
burndown_rate Output token multiplier for combined TPM (default: 1.0, AWS Bedrock: 5.0)
smooth_requests Enable RPS smoothing for burst prevention (default: False)
rps Explicit requests-per-second limit (auto-enables smoothing when > 0)
smoothing_interval Evaluation interval for RPS in seconds (default: 1.0)

RetryConfig

Parameter Description
max_retries Maximum retry attempts (default: 3)
base_delay Initial delay in seconds (default: 0.1)
max_delay Maximum delay cap (default: 5.0)
exponential_base Backoff multiplier (default: 2.0)
jitter Random variation 0-1 (default: 0.1)

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmratelimiter-0.3.0.tar.gz (90.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmratelimiter-0.3.0-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file llmratelimiter-0.3.0.tar.gz.

File metadata

  • Download URL: llmratelimiter-0.3.0.tar.gz
  • Upload date:
  • Size: 90.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmratelimiter-0.3.0.tar.gz
Algorithm Hash digest
SHA256 502ba47e081b753fe434b497e73c7ee2102e4b4480b819453cc1550cdfd6a017
MD5 08dd5fcf4a27ca4cace89dfab864b83d
BLAKE2b-256 e97486d2395b1550c7b4ec265bd7693140148e44df71acd5d140b7bebb706f55

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmratelimiter-0.3.0.tar.gz:

Publisher: publish.yml on Ameyanagi/LLMRateLimiter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmratelimiter-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: llmratelimiter-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmratelimiter-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc3f1e65393caa46f4622b5426d859e61696e6da4b9e578aa1d2e2d836b950c8
MD5 d076c8faa30b647f3c3db58af33102c7
BLAKE2b-256 b96eadef742d0f2e46b8b1a82fd74972e190bff5b2fc18c7ff533ce8c87acff2

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmratelimiter-0.3.0-py3-none-any.whl:

Publisher: publish.yml on Ameyanagi/LLMRateLimiter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page