Skip to main content

Simple Multi-Resource Rate Limiting That Saves Unused Tokens. Rate limit API requests across different resources and workers without wasting your quota. Reserve tokens upfront, get refunds for what you don't use, and avoid over-limiting.

Project description

token-throttle

PyPI Version PyPI Downloads stability-beta License: Apache 2.0 Maintained: yes CI codecov Linter: Ruff

Multi-resource rate limiting for LLM APIs. Reserve tokens before you call, refund what you don't use, stay under the limit across workers.

Works with any LLM provider and any client library — token-throttle limits the rate, not the client.

pip install "token-throttle[redis,tiktoken]>=1.2.0,<1.3.0"   # OpenAI + Redis (recommended)
pip install "token-throttle[redis]>=1.2.0,<1.3.0"            # Any provider + Redis
pip install "token-throttle>=1.2.0,<1.3.0"                   # Any provider + in-memory

Quickstart

OpenAI (built-in helpers)

from openai import AsyncOpenAI
from token_throttle import create_openai_redis_rate_limiter

client = AsyncOpenAI()
limiter = create_openai_redis_rate_limiter(
    redis_client, rpm=10_000, tpm=2_000_000,
)

# 1. Reserve capacity (blocks until available)
request = dict(model="gpt-4.1", messages=[{"role": "user", "content": "Hi"}])
reservation = await limiter.acquire_capacity_for_request(**request, extra_usage=None)

# 2. Make the API call
response = await client.chat.completions.create(**request)

# 3. Refund unused tokens
await limiter.refund_capacity_from_response(reservation, response)

Any provider (manual usage)

from token_throttle import RateLimiter, Quota, UsageQuotas, RedisBackendBuilder
from token_throttle import PerModelConfig

limiter = RateLimiter(
    lambda model: PerModelConfig(
        quotas=UsageQuotas([
            Quota(metric="requests", limit=1_000, per_seconds=60),
            Quota(metric="input_tokens", limit=80_000, per_seconds=60),
            Quota(metric="output_tokens", limit=20_000, per_seconds=60),
        ]),
    ),
    backend=RedisBackendBuilder(redis_client),
)

# Works with Anthropic, Gemini, local models — anything
reservation = await limiter.acquire_capacity(
    model="claude-sonnet-4-20250514",
    usage={"requests": 1, "input_tokens": 500, "output_tokens": 4_000},
)

response = await call_your_llm(...)  # Use whatever client you want

await limiter.refund_capacity(
    actual_usage={"requests": 1, "input_tokens": 480, "output_tokens": 1_200},
    reservation=reservation,
)
# Unused 2,800 output tokens returned to the pool

Why token-throttle

The problem: You're running parallel LLM calls (batch processing, agents, multiple services sharing a key). Simple rate limiters waste throughput because they reserve worst-case tokens and never give them back. You hit 429s or crawl at half capacity.

The solution: Reserve before you call, refund after. Actual usage is tracked, not estimated maximums.

Feature Details
Multi-resource limits Limit requests, tokens, input/output tokens — simultaneously, each with its own quota
Multiple time windows e.g., 1,000 req/min AND 10,000 req/day on the same resource
Reserve & refund Reserve max expected usage upfront, refund the difference after the call completes
Distributed Redis backend with atomic locks — safe across workers and processes
Per-model quotas Different limits per model via model_family; the built-in OpenAI helper auto-groups date-suffixed variants (e.g. gpt-4o-20241203 → gpt-4o)
Pluggable Bring your own backend (ships with Redis and in-memory). Sync and async APIs
Observability Callbacks for wait-start, wait-end, consume, refund, and missing-state events

How it works

token-throttle implements a token bucket algorithm (capacity refills linearly over time, capped at the quota limit).

  • Acquire — blocks until enough capacity is available, then atomically reserves it
  • Call — make your API request with any client
  • Refund — report actual usage; unused tokens return to the pool immediately

The Redis backend uses sorted locking to prevent deadlocks when acquiring multiple resource buckets simultaneously.

Configuration

Quotas

from token_throttle import Quota, UsageQuotas, SecondsIn

quotas = UsageQuotas([
    Quota(metric="requests", limit=2_000, per_seconds=SecondsIn.MINUTE),
    Quota(metric="tokens", limit=3_000_000, per_seconds=SecondsIn.MINUTE),
    Quota(metric="requests", limit=10_000_000, per_seconds=SecondsIn.DAY),
])

per_seconds accepts integer seconds. Use SecondsIn.MINUTE (60), SecondsIn.HOUR (3600), SecondsIn.DAY (86400), or any integer.

Per-model configuration

def get_config(model_name: str) -> PerModelConfig:
    if model_name.startswith("gpt"):
        return PerModelConfig(
            quotas=UsageQuotas([
                Quota(metric="requests", limit=10_000, per_seconds=60),
                Quota(metric="tokens", limit=2_000_000, per_seconds=60),
            ]),
            usage_counter=OpenAIUsageCounter(),  # auto-counts tokens from messages
            model_family=openai_model_family_getter(model_name),
        )
    # ... other providers

limiter = RateLimiter(get_config, backend=RedisBackendBuilder(redis_client))

Backends

# Distributed (multiple workers/processes)
from token_throttle import RedisBackendBuilder
backend = RedisBackendBuilder(redis_client)

# Single process (no Redis needed)
from token_throttle import MemoryBackendBuilder
backend = MemoryBackendBuilder()

Both backends are available in sync (SyncRedisBackendBuilder, SyncMemoryBackendBuilder) and async variants.

Dynamic rate limits

Adjust bucket limits at runtime without rebuilding the limiter — useful for adaptive rate limiting (e.g., reacting to x-ratelimit-* response headers):

# After at least one acquire/record call for this model:
await limiter.set_max_capacity(
    model="gpt-4o",
    metric="tokens",
    per_seconds=60,
    value=5000,
)

For Redis backends the new limit is written to Redis, so all processes sharing the same Redis see the change within ~1 second.

Timeout

By default, acquire_capacity blocks until enough capacity is available. Use timeout to fail fast or cap the wait:

# Non-blocking: check if capacity is available without waiting
try:
    reservation = await limiter.acquire_capacity(
        model="gpt-4o",
        usage={"requests": 1, "tokens": 500},
        timeout=0,  # Fail immediately if no capacity
    )
except TimeoutError:
    # Handle: retry later, use cheaper model, skip, etc.
    pass

# Bounded wait: wait up to 5 seconds
reservation = await limiter.acquire_capacity(
    model="gpt-4o",
    usage={"requests": 1, "tokens": 500},
    timeout=5.0,  # Raise TimeoutError after 5s
)

Sync API

from token_throttle import SyncRateLimiter, SyncMemoryBackendBuilder

limiter = SyncRateLimiter(get_config, backend=SyncMemoryBackendBuilder())

reservation = limiter.acquire_capacity(model="gpt-4.1", usage={"requests": 1, "tokens": 500})
response = call_llm_sync(...)
limiter.refund_capacity(actual_usage={"requests": 1, "tokens": 320}, reservation=reservation)

Links

GitHub Repo stars

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

token_throttle-1.2.0.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

token_throttle-1.2.0-py3-none-any.whl (58.6 kB view details)

Uploaded Python 3

File details

Details for the file token_throttle-1.2.0.tar.gz.

File metadata

  • Download URL: token_throttle-1.2.0.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for token_throttle-1.2.0.tar.gz
Algorithm Hash digest
SHA256 a59a709dbb6541f58c1f1f1323514fba33bdb0c8fec66eae85ca8b5cd04435da
MD5 03b39887a5eac28dc3ed9bce264735c7
BLAKE2b-256 f36ba543dceccab8f2aad8cf82c6121a73b84520f9fe6c1a8fb7649b9a61b366

See more details on using hashes here.

Provenance

The following attestation bundles were made for token_throttle-1.2.0.tar.gz:

Publisher: release.yml on Elijas/token-throttle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file token_throttle-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: token_throttle-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 58.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for token_throttle-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b31ae305e21c5b9e8e96f6e67b4fe4d6158d6ba055f71496645fc64d032015a
MD5 b687f3191a188703266e0a317f06730e
BLAKE2b-256 f21ba234b111a1cb37e19edcf5a99d5c82a67689c7342669b2521621a4c27b2c

See more details on using hashes here.

Provenance

The following attestation bundles were made for token_throttle-1.2.0-py3-none-any.whl:

Publisher: release.yml on Elijas/token-throttle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page