Rate limiter for LLM
Project description
LLMRateLimiter
Client-side rate limiting for LLM API calls using Redis-backed FIFO queues.
- Documentation: https://Ameyanagi.github.io/LLMRateLimiter/
- Repository: https://github.com/Ameyanagi/LLMRateLimiter/
Features
- FIFO Queue-Based: Fair ordering prevents thundering herd problems
- Distributed: Redis-backed for multi-process/multi-server deployments
- Flexible Limits: Supports combined TPM, split input/output TPM, or both
- Automatic Retry: Exponential backoff with jitter for Redis connection issues
- Graceful Degradation: Allows requests through on Redis failure
How It Works
flowchart LR
subgraph Client["Your Application"]
App[LLM App]
end
subgraph RL["LLMRateLimiter"]
Limiter[RateLimiter]
end
subgraph Redis["Redis"]
Queue[(FIFO Queue<br/>Sorted Set)]
end
subgraph LLM["LLM Provider"]
API[API]
end
App -->|1. acquire| Limiter
Limiter -->|2. Check limits| Queue
Queue -->|3. Wait time| Limiter
Limiter -->|4. Return| App
App -->|5. Call API| API
The rate limiter uses Redis sorted sets to maintain a FIFO queue of requests. Each request records its token consumption, and the Lua script atomically calculates when capacity will be available based on the sliding window.
Installation
pip install llmratelimiter
Or with uv:
uv add llmratelimiter
Quick Start
Basic Usage
from llmratelimiter import RateLimiter
# Just pass a Redis URL and your limits
limiter = RateLimiter("redis://localhost:6379", "gpt-4", tpm=100_000, rpm=100)
# Recommended: specify input and output tokens separately
await limiter.acquire(input_tokens=3000, output_tokens=2000)
response = await openai.chat.completions.create(...)
Split Mode (GCP Vertex AI)
For providers with separate input/output token limits:
limiter = RateLimiter(
"redis://localhost:6379", "gemini-1.5-pro",
input_tpm=4_000_000, output_tpm=128_000, rpm=360
)
# Estimate output tokens upfront
result = await limiter.acquire(input_tokens=5000, output_tokens=2048)
response = await vertex_ai.generate(...)
# Adjust after getting actual output
await limiter.adjust(result.record_id, actual_output=response.output_tokens)
AWS Bedrock (Burndown Rate)
AWS Bedrock uses a burndown rate where output tokens count 5x toward TPM:
limiter = RateLimiter(
"redis://localhost:6379", "claude-sonnet",
tpm=100_000, rpm=100, burndown_rate=5.0
)
await limiter.acquire(input_tokens=3000, output_tokens=1000)
# TPM consumption: 3000 + (5.0 * 1000) = 8000 tokens
With Existing Redis Client
from redis.asyncio import Redis
from llmratelimiter import RateLimiter
redis = Redis(host="localhost", port=6379)
limiter = RateLimiter(redis=redis, model="gpt-4", tpm=100_000, rpm=100)
await limiter.acquire(input_tokens=3000, output_tokens=2000)
With Connection Manager (Production)
For production use with automatic retry and connection pooling:
from llmratelimiter import RateLimiter, RedisConnectionManager, RetryConfig
manager = RedisConnectionManager(
"redis://localhost:6379",
retry_config=RetryConfig(max_retries=3, base_delay=0.1),
)
limiter = RateLimiter(manager, "gpt-4", tpm=100_000, rpm=100)
await limiter.acquire(input_tokens=3000, output_tokens=2000)
SSL Connection
Use rediss:// for SSL/TLS connections:
limiter = RateLimiter("rediss://localhost:6379", "gpt-4", tpm=100_000, rpm=100)
Configuration Options
RateLimitConfig
| Parameter | Description |
|---|---|
tpm |
Combined tokens-per-minute limit |
input_tpm |
Input tokens-per-minute limit |
output_tpm |
Output tokens-per-minute limit |
rpm |
Requests-per-minute limit |
window_seconds |
Sliding window size (default: 60) |
burst_multiplier |
Allow burst above limits (default: 1.0) |
burndown_rate |
Output token multiplier for combined TPM (default: 1.0, AWS Bedrock: 5.0) |
RetryConfig
| Parameter | Description |
|---|---|
max_retries |
Maximum retry attempts (default: 3) |
base_delay |
Initial delay in seconds (default: 0.1) |
max_delay |
Maximum delay cap (default: 5.0) |
exponential_base |
Backoff multiplier (default: 2.0) |
jitter |
Random variation 0-1 (default: 0.1) |
License
MIT License - see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmratelimiter-0.2.0.tar.gz.
File metadata
- Download URL: llmratelimiter-0.2.0.tar.gz
- Upload date:
- Size: 88.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45c96aebfbafdb754a16baae7f79e9baa2c92419c39e085d585ab2eb4191ca61
|
|
| MD5 |
ed714707ab527379e5a3d29c1352ec4d
|
|
| BLAKE2b-256 |
b73c186be7c11bceed4bf8ba4c62e52beb6b65a2a3252467073b4920dbaeea01
|
Provenance
The following attestation bundles were made for llmratelimiter-0.2.0.tar.gz:
Publisher:
publish.yml on Ameyanagi/LLMRateLimiter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmratelimiter-0.2.0.tar.gz -
Subject digest:
45c96aebfbafdb754a16baae7f79e9baa2c92419c39e085d585ab2eb4191ca61 - Sigstore transparency entry: 747531564
- Sigstore integration time:
-
Permalink:
Ameyanagi/LLMRateLimiter@8505bbafc09a3d6d8a26b44dfb65716848bcdb1b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Ameyanagi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8505bbafc09a3d6d8a26b44dfb65716848bcdb1b -
Trigger Event:
release
-
Statement type:
File details
Details for the file llmratelimiter-0.2.0-py3-none-any.whl.
File metadata
- Download URL: llmratelimiter-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d17f8b7bd71d406db6811e28cc6a51ab1469740dc2c51f708a3af3fa534e68fb
|
|
| MD5 |
424bc47dc0d256706a02debdd4111128
|
|
| BLAKE2b-256 |
26772ed05782022ecc77354431d45429a8fdc7e0de1aa28d39cec2f0a2d5f616
|
Provenance
The following attestation bundles were made for llmratelimiter-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on Ameyanagi/LLMRateLimiter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmratelimiter-0.2.0-py3-none-any.whl -
Subject digest:
d17f8b7bd71d406db6811e28cc6a51ab1469740dc2c51f708a3af3fa534e68fb - Sigstore transparency entry: 747531565
- Sigstore integration time:
-
Permalink:
Ameyanagi/LLMRateLimiter@8505bbafc09a3d6d8a26b44dfb65716848bcdb1b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Ameyanagi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8505bbafc09a3d6d8a26b44dfb65716848bcdb1b -
Trigger Event:
release
-
Statement type: