Rate limiter for LLM

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ameyanagi

These details have not been verified by PyPI

Project description

LLMRateLimiter

Client-side rate limiting for LLM API calls using Redis-backed FIFO queues.

Documentation: https://Ameyanagi.github.io/LLMRateLimiter/
Repository: https://github.com/Ameyanagi/LLMRateLimiter/

Features

FIFO Queue-Based: Fair ordering prevents thundering herd problems
Distributed: Redis-backed for multi-process/multi-server deployments
Flexible Limits: Supports combined TPM, split input/output TPM, or both
Automatic Retry: Exponential backoff with jitter for Redis connection issues
Graceful Degradation: Allows requests through on Redis failure

How It Works

flowchart LR
    subgraph Client["Your Application"]
        App[LLM App]
    end

    subgraph RL["LLMRateLimiter"]
        Limiter[RateLimiter]
    end

    subgraph Redis["Redis"]
        Queue[(FIFO Queue<br/>Sorted Set)]
    end

    subgraph LLM["LLM Provider"]
        API[API]
    end

    App -->|1. acquire| Limiter
    Limiter -->|2. Check limits| Queue
    Queue -->|3. Wait time| Limiter
    Limiter -->|4. Return| App
    App -->|5. Call API| API

The rate limiter uses Redis sorted sets to maintain a FIFO queue of requests. Each request records its token consumption, and the Lua script atomically calculates when capacity will be available based on the sliding window.

Installation

pip install llmratelimiter

Or with uv:

uv add llmratelimiter

Quick Start

Basic Usage

from llmratelimiter import RateLimiter

# Just pass a Redis URL and your limits
limiter = RateLimiter("redis://localhost:6379", "gpt-4", tpm=100_000, rpm=100)

# Recommended: specify input and output tokens separately
await limiter.acquire(input_tokens=3000, output_tokens=2000)
response = await openai.chat.completions.create(...)

Split Mode (GCP Vertex AI)

For providers with separate input/output token limits:

limiter = RateLimiter(
    "redis://localhost:6379", "gemini-1.5-pro",
    input_tpm=4_000_000, output_tpm=128_000, rpm=360
)

# Estimate output tokens upfront
result = await limiter.acquire(input_tokens=5000, output_tokens=2048)
response = await vertex_ai.generate(...)

# Adjust after getting actual output
await limiter.adjust(result.record_id, actual_output=response.output_tokens)

AWS Bedrock (Burndown Rate)

AWS Bedrock uses a burndown rate where output tokens count 5x toward TPM:

limiter = RateLimiter(
    "redis://localhost:6379", "claude-sonnet",
    tpm=100_000, rpm=100, burndown_rate=5.0
)

await limiter.acquire(input_tokens=3000, output_tokens=1000)
# TPM consumption: 3000 + (5.0 * 1000) = 8000 tokens

Azure OpenAI (RPS Smoothing)

Azure OpenAI enforces rate limits at sub-second intervals. If you set 600 RPM, Azure actually enforces 10 requests per second. Bursts can trigger 429 errors even when you're under the minute-level limit.

Enable RPS smoothing to prevent burst-triggered rate limits:

# Auto-calculate RPS from RPM (600 RPM = 10 RPS = 100ms minimum gap)
limiter = RateLimiter(
    "redis://localhost:6379", "gpt-4",
    tpm=300_000, rpm=600, smooth_requests=True
)

# Or set explicit RPS for more conservative rate limiting
limiter = RateLimiter(
    "redis://localhost:6379", "gpt-4",
    tpm=300_000, rpm=600, rps=8  # 125ms minimum gap
)

# Custom evaluation interval (Azure may use 1s or 10s intervals)
limiter = RateLimiter(
    "redis://localhost:6379", "gpt-4",
    tpm=300_000, rpm=600, smooth_requests=True, smoothing_interval=10.0
)

With Existing Redis Client

from redis.asyncio import Redis
from llmratelimiter import RateLimiter

redis = Redis(host="localhost", port=6379)
limiter = RateLimiter(redis=redis, model="gpt-4", tpm=100_000, rpm=100)

await limiter.acquire(input_tokens=3000, output_tokens=2000)

With Connection Manager (Production)

For production use with automatic retry and connection pooling:

from llmratelimiter import RateLimiter, RedisConnectionManager, RetryConfig

manager = RedisConnectionManager(
    "redis://localhost:6379",
    retry_config=RetryConfig(max_retries=3, base_delay=0.1),
)
limiter = RateLimiter(manager, "gpt-4", tpm=100_000, rpm=100)

await limiter.acquire(input_tokens=3000, output_tokens=2000)

SSL Connection

Use rediss:// for SSL/TLS connections:

limiter = RateLimiter("rediss://localhost:6379", "gpt-4", tpm=100_000, rpm=100)

Configuration Options

RateLimitConfig

Parameter	Description
`tpm`	Combined tokens-per-minute limit
`input_tpm`	Input tokens-per-minute limit
`output_tpm`	Output tokens-per-minute limit
`rpm`	Requests-per-minute limit
`window_seconds`	Sliding window size (default: 60)
`burst_multiplier`	Allow burst above limits (default: 1.0)
`burndown_rate`	Output token multiplier for combined TPM (default: 1.0, AWS Bedrock: 5.0)
`smooth_requests`	Enable RPS smoothing for burst prevention (default: True)
`rps`	Explicit requests-per-second limit (auto-enables smoothing when > 0)
`smoothing_interval`	Evaluation interval for RPS in seconds (default: 1.0)

RetryConfig

Parameter	Description
`max_retries`	Maximum retry attempts (default: 3)
`base_delay`	Initial delay in seconds (default: 0.1)
`max_delay`	Maximum delay cap (default: 5.0)
`exponential_base`	Backoff multiplier (default: 2.0)
`jitter`	Random variation 0-1 (default: 0.1)

License

MIT License - see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ameyanagi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.2

Dec 18, 2025

This version

0.3.1

Dec 17, 2025

0.3.0

Dec 17, 2025

0.2.0

Dec 7, 2025

0.0.1

Dec 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmratelimiter-0.3.1.tar.gz (90.7 kB view details)

Uploaded Dec 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmratelimiter-0.3.1-py3-none-any.whl (17.1 kB view details)

Uploaded Dec 17, 2025 Python 3

File details

Details for the file llmratelimiter-0.3.1.tar.gz.

File metadata

Download URL: llmratelimiter-0.3.1.tar.gz
Upload date: Dec 17, 2025
Size: 90.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmratelimiter-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`ea6eb03d5866f9dbe3f2e14536ed0277e4870826bb8431338fff01e2a5eb3603`
MD5	`642c7c35159ea186eb496df4dfb220fe`
BLAKE2b-256	`309ace548e6c58b63f1205b498b0c7cd35609777524465323c7741c6f3922e40`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmratelimiter-0.3.1.tar.gz:

Publisher: publish.yml on Ameyanagi/LLMRateLimiter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmratelimiter-0.3.1.tar.gz
- Subject digest: ea6eb03d5866f9dbe3f2e14536ed0277e4870826bb8431338fff01e2a5eb3603
- Sigstore transparency entry: 768369990
- Sigstore integration time: Dec 17, 2025
Source repository:
- Permalink: Ameyanagi/LLMRateLimiter@f3c0545f286bb8cc1ee578d0ad759c4b621aa62e
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/Ameyanagi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f3c0545f286bb8cc1ee578d0ad759c4b621aa62e
- Trigger Event: release

File details

Details for the file llmratelimiter-0.3.1-py3-none-any.whl.

File metadata

Download URL: llmratelimiter-0.3.1-py3-none-any.whl
Upload date: Dec 17, 2025
Size: 17.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmratelimiter-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a00d6293712debf7a9483b1d360fb353368389593cef98d6f6f354820760133e`
MD5	`19dbf54f763f9b50f2b2822e7e1cef10`
BLAKE2b-256	`57657db3fdef27b8f5f622087a6eea30327e51bd1a05559f16e740ab1f9ed58b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmratelimiter-0.3.1-py3-none-any.whl:

Publisher: publish.yml on Ameyanagi/LLMRateLimiter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmratelimiter-0.3.1-py3-none-any.whl
- Subject digest: a00d6293712debf7a9483b1d360fb353368389593cef98d6f6f354820760133e
- Sigstore transparency entry: 768370019
- Sigstore integration time: Dec 17, 2025
Source repository:
- Permalink: Ameyanagi/LLMRateLimiter@f3c0545f286bb8cc1ee578d0ad759c4b621aa62e
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/Ameyanagi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f3c0545f286bb8cc1ee578d0ad759c4b621aa62e
- Trigger Event: release

LLMRateLimiter 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LLMRateLimiter

Features

How It Works

Installation

Quick Start

Basic Usage

Split Mode (GCP Vertex AI)

AWS Bedrock (Burndown Rate)

Azure OpenAI (RPS Smoothing)

With Existing Redis Client

With Connection Manager (Production)

SSL Connection

Configuration Options

RateLimitConfig

RetryConfig

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance