maxllm_gate - Intelligent LLM client with built-in rate limiting. Maximizes throughput and prevents 429 errors.

These details have not been verified by PyPI

Project links

Project description

maxllm_gate

Production-ready intelligent LLM client with built-in rate limiting, smart routing, and distributed state support. Published on PyPI as maxllm_gate.

Overview

maxllm_gate is a production-ready LLM client that automatically manages rate limits across multiple API keys and providers. It works on top of LiteLLM as an intelligent scheduling and optimization layer.

Install from PyPI with maxllm_gate. Import in Python with from maxllm_gate import ....

from maxllm_gate import maxllm_gate
import asyncio

async def main():
    # Load from environment variables
    async with maxllm_gate.from_env() as client:
        # Use like OpenAI client - rate limiting is automatic
        response = await client.chat("gpt-4o-mini", "Explain quantum computing")
        print(response.content)

asyncio.run(main())

What it does automatically:

✅ Multi-key management - Manages multiple API keys across providers (Groq, OpenAI, OpenRouter, etc.)
✅ Real-time rate limiting - Tracks TPM/RPM limits per key and per model with token bucket algorithm
✅ Smart routing - SDK supports least_utilized, round_robin, latency_aware, and balanced; the FastAPI gateway supports least_utilized, round_robin, and token_aware
✅ No 429 errors - Defers requests when capacity exhausted instead of failing
✅ Auto-retry - Exponential backoff on transient failures
✅ Streaming support - Async streaming with proper token tracking
✅ Input validation - Pydantic models validate all inputs
✅ Graceful shutdown - Context manager support with proper cleanup
✅ Production-ready - Optional Redis backend for distributed state

Installation

# Base installation
pip install maxllm_gate

# With Redis backend (for production/distributed deployments)
pip install maxllm_gate[redis]

# With server mode (optional FastAPI gateway)
pip install maxllm_gate[server]

# Everything (recommended for production)
pip install maxllm_gate[all]

from maxllm_gate import maxllm_gate

Quick Start

1. Create `.env`

cp .env.example .env

Set API_KEYS_CONFIG in .env:

API_KEYS_CONFIG='{
  "groq-1": {
    "api_key": "gsk_your_groq_key",
    "provider": "groq",
    "models": {
      "llama-3.1-70b-versatile": {"tpm_limit": 30000, "rpm_limit": 30},
      "mixtral-8x7b-32768": {"tpm_limit": 15000, "rpm_limit": 20}
    }
  },
  "openai-1": {
    "api_key": "sk-your_openai_key",
    "provider": "openai",
    "models": {
      "gpt-4o-mini": {"tpm_limit": 90000, "rpm_limit": 500},
      "gpt-4o": {"tpm_limit": 30000, "rpm_limit": 200}
    }
  }
}'

DEFAULT_STRATEGY=least_utilized

2. Use it (Async-only)

from maxllm_gate import maxllm_gate
import asyncio

async def main():
    # Context manager ensures graceful shutdown
    async with maxllm_gate.from_env() as client:
        # Simple chat
        response = await client.chat("gpt-4o-mini", "Hello!")
        print(response.content)
        
        # With messages list
        response = await client.chat("mixtral-8x7b-32768", [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a haiku about Python."},
        ])
        
        # Streaming
        async for chunk in client.chat_stream("gpt-4o-mini", "Tell me a story"):
            print(chunk, end="", flush=True)
        
        # Check capacity and scores
        print(client.capacity())
        print(client.scores())  # See routing decisions

asyncio.run(main())

Async Usage (Native)

from maxllm_gate import maxllm_gate
import asyncio

async def main():
    # Async context manager
    async with maxllm_gate.from_env() as client:
        # Single request
        response = await client.chat("gpt-4o-mini", "Hello!")
        print(response.content)
        
        # Concurrent requests - automatically load balanced
        tasks = [
            client.chat("gpt-4o-mini", f"Question {i}")
            for i in range(10)
        ]
        responses = await asyncio.gather(*tasks)
        
        # Async streaming
        async for chunk in client.chat_stream("gpt-4o-mini", "Tell a story"):
            print(chunk, end="", flush=True)

asyncio.run(main())

Configuration

Environment Variables

maxllm_gate uses environment variables as the shared configuration source for both the SDK and the FastAPI gateway.

API_KEYS_CONFIG is required, and each configured model must declare its own tpm_limit and rpm_limit. Server settings such as HOST, PORT, DEBUG, and LOG_LEVEL are optional. BASE_URL is only needed by the HTTP example scripts and other external integrations that call the gateway over HTTP.

HOST=0.0.0.0
PORT=8000
DEBUG=false
LOG_LEVEL=INFO

API_KEYS_CONFIG='{
  "groq-1": {"api_key": "gsk_key_1", "provider": "groq", "models": {"llama-3.1-70b-versatile": {"tpm_limit": 30000, "rpm_limit": 30}}},
  "groq-2": {"api_key": "gsk_key_2", "provider": "groq", "models": {"llama-3.1-70b-versatile": {"tpm_limit": 30000, "rpm_limit": 30}}}
}'
DEFAULT_STRATEGY=least_utilized
DEFAULT_MAX_TOKENS=1024
DEFAULT_TEMPERATURE=0.7
TOKEN_ESTIMATION_BUFFER=1.1
MAX_RETRIES=3
RETRY_BASE_DELAY=1.0
RETRY_MAX_DELAY=60.0
MAX_QUEUE_SIZE=10000
DEFAULT_PRIORITY=medium
REDIS_URL=redis://localhost:6379

Routing Strategies

The SDK and the FastAPI gateway do not expose the exact same routing strategies.

Strategy	Available In	Best For	How It Works
`balanced`	SDK	Production client usage	Combines utilization, latency, recent errors, and freshness into a weighted score.
`latency_aware`	SDK	Low latency	Prefers keys with the fastest observed response times.
`least_utilized`	SDK and gateway	Safe default	Routes to the key with the most available TPM/RPM headroom.
`round_robin`	SDK and gateway	Fair distribution	Cycles through keys evenly.
`token_aware`	Gateway	Server-side scheduling	Prefers keys with enough token capacity for the request.

Recommended: Use least_utilized in .env if you want one strategy setting that works for both the SDK and the gateway. Use balanced when working directly with the SDK and you want the richer scoring model.

# See routing decisions in real-time
scores = client.scores()
print(scores)
# {
#   "groq-1": {
#     "total_score": 0.23,      # Lower = better
#     "utilization": 0.15,      # 15% capacity used
#     "latency_normalized": 0.08,
#     "latency_avg_ms": 245.5,
#     "error_penalty": 0.0,     # No recent errors
#     "freshness": 0.85
#   },
#   ...
# }

Environment Variables

export API_KEYS_CONFIG='{
  "groq-1": {"api_key": "gsk_...", "provider": "groq", "models": {"mixtral-8x7b-32768": {"tpm_limit": 30000, "rpm_limit": 30}}}
}'

# Then in Python
from maxllm_gate import maxllm_gate
client = maxllm_gate.from_env()

FastAPI Gateway

Install the gateway extras and start the HTTP server:

pip install maxllm_gate[server]
maxllm_gate-server

The gateway reads HOST, PORT, DEBUG, and LOG_LEVEL from .env. Set BASE_URL for any example scripts or integrations that call the gateway over HTTP.

Supported Providers

maxllm_gate works with any provider supported by LiteLLM:

Provider	Config Name	Example Models
OpenAI	`openai`	`gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`
Groq	`groq`	`llama-3.1-70b-versatile`, `mixtral-8x7b-32768`
OpenRouter	`openrouter`	`anthropic/claude-3-haiku`, `meta-llama/llama-3-70b`
Anthropic	`anthropic`	`claude-3-haiku-20240307`, `claude-3-5-sonnet-20241022`
Together AI	`together_ai`	`mistralai/Mixtral-8x7B-Instruct-v0.1`
Anyscale	`anyscale`	`meta-llama/Llama-3-70b-chat-hf`
Fireworks	`fireworks_ai`	`accounts/fireworks/models/llama-v3-70b-instruct`
NVIDIA NIM	`nvidia_nim`	Any NVIDIA NIM endpoint
Azure OpenAI	`azure`	Your Azure deployments

Provider Configuration

API_KEYS_CONFIG='{
  "openai-1": {"api_key": "sk-...", "provider": "openai", "models": {"gpt-4o-mini": {"tpm_limit": 90000, "rpm_limit": 500}}},
  "groq-1": {"api_key": "gsk_...", "provider": "groq", "models": {"llama-3.1-70b-versatile": {"tpm_limit": 30000, "rpm_limit": 30}}},
  "openrouter-1": {"api_key": "sk-or-...", "provider": "openrouter", "models": {"anthropic/claude-3-haiku": {"tpm_limit": 100000, "rpm_limit": 200}}}
}'

How It Works

┌─────────────────────────────────────────────────────────────────┐
│                         Your Code                                │
│   response = await client.chat("gpt-4o-mini", "Hello!")         │
└────────────────────────────────┬────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                        maxllm_gate Client                             │
│  1. Validate inputs (Pydantic)                                   │
│  2. Estimate tokens needed (~50 tokens)                          │
│  3. Select best key using routing strategy                       │
│  4. Check capacity - defer if needed                             │
│  5. Execute via LiteLLM                                          │
│  6. Record latency & update rate limits                          │
└────────────────────────────────┬────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                          LiteLLM                                 │
│                    (handles provider API)                        │
└────────────────────────────────┬────────────────────────────────┘
                                 │
                                 ▼
                         ┌──────────────┐
                         │   OpenAI     │
                         └──────────────┘

Key Features

1. Deferred Execution (No 429 Errors)

When ALL keys are at capacity, maxllm_gate doesn't fail - it waits:

# If all keys exhausted, request is automatically deferred
# until capacity is available (no 429 errors!)
response = await client.chat("gpt-4o-mini", "Hello")  # May wait, then succeeds

2. Input Validation

All inputs are validated with Pydantic before execution:

from maxllm_gate.validation import validate_chat_request

# Manual validation
request = validate_chat_request(
    model="gpt-4",
    messages="Hello!",
    temperature=0.7,
)

# Automatic validation (default)
response = await client.chat("gpt-4", "Hello!", validate=True)  # ✅ Validated

# Skip validation for performance (not recommended)
response = await client.chat("gpt-4", "Hello!", validate=False)

Validation checks:

✅ Model name is valid (no special characters)
✅ Messages are not empty or whitespace-only
✅ Temperature is 0-2
✅ Max tokens is positive
✅ Priority is high/medium/low
✅ Roles are valid (system/user/assistant/function/tool)

3. Graceful Shutdown

Use context managers for automatic cleanup:

# Async (only mode supported)
async with maxllm_gate.from_env() as client:
    response = await client.chat(...)
# Waits for in-flight requests, then shuts down

Or manual shutdown:

import asyncio

async def main():
    client = maxllm_gate.from_env()
    try:
        response = await client.chat(...)
    finally:
        await client.shutdown(timeout=30)  # Wait max 30s for pending requests

asyncio.run(main())

Production Deployment

Redis Backend (Recommended for Production)

For distributed deployments or to persist rate limit state across restarts, use Redis:

pip install maxllm_gate[redis]

REDIS_URL=redis://localhost:6379
API_KEYS_CONFIG='{"openai-1": {"api_key": "sk-...", "provider": "openai", "models": {"gpt-4o-mini": {"tpm_limit": 90000, "rpm_limit": 500}}}}'

Redis provides:

🔄 Persistent state - Rate limits survive restarts
🌐 Distributed coordination - Multiple instances share state
📊 Centralized metrics - Latency tracking across all instances
🔒 Distributed locks - Atomic operations across workers

Using HybridRateLimiter (auto-fallback):

from maxllm_gate.redis_backend import HybridRateLimiter
import asyncio

# Tries Redis, falls back to in-memory if unavailable
limiter = HybridRateLimiter(
    redis_url="redis://localhost:6379",
    fallback_to_memory=True,
)

await limiter.initialize()

if limiter.is_distributed:
    print("✅ Using Redis backend")
else:
    print("⚠️ Fallback to in-memory (Redis unavailable)")

Docker Deployment

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install maxllm_gate[all]

COPY .env .
COPY app.py .

CMD ["python", "app.py"]

# docker-compose.yml
version: '3.8'
services:
  maxllm_gate:
    build: .
    environment:
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
  
  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

volumes:
  redis-data:

Monitoring & Observability

from maxllm_gate import maxllm_gate

client = maxllm_gate.from_env()

# Check capacity across all keys
capacity = client.capacity()
print(f"Total capacity: {capacity['total_capacity']}")

# View latency stats per key
latency = client.latency()
for key_id, stats in latency.items():
    print(f"{key_id}: avg={stats['avg_ms']:.1f}ms, p99={stats['p99_ms']:.1f}ms")

# Debug routing decisions
scores = client.scores()
for key_id, score_data in scores.items():
    print(f"{key_id}: score={score_data['total_score']:.2f}, "
          f"util={score_data['utilization']:.2f}, "
          f"latency={score_data['latency_avg_ms']:.1f}ms")

Health Checks

# For Kubernetes/Docker health checks
def health_check():
    try:
        capacity = client.capacity()
        # Check if any key has capacity
        has_capacity = any(
            key['tokens_remaining'] > 1000 
            for key in capacity['keys'].values()
        )
        return has_capacity
    except Exception:
        return False

API Reference

ChatResponse

response = await client.chat("gpt-4o-mini", "Hello")

response.content       # The generated text
response.model         # Model used
response.usage         # {"prompt_tokens": 10, "completion_tokens": 20, "total_tokens": 30}
response.finish_reason # "stop", "length", etc.
response.latency       # Total request time in seconds
response.llm_latency   # LLM provider time only (NEW)
response.key_used      # Which API key was used

maxllm_gate Methods

Method	Description	Returns
`chat(model, messages, **kwargs)`	Async chat completion	`ChatResponse`
`chat_stream(model, messages, **kwargs)`	Async streaming completion	`AsyncGenerator[str]`
`add_key(api_key, provider, models, ...)`	Add key at runtime	`None`
`status()`	Get scheduler status	`dict`
`capacity()`	Get current capacity	`dict`
`latency()`	Get latency stats per key (NEW)	`dict`
`scores()`	Get routing scores (NEW)	`dict`
`shutdown(timeout)`	Graceful shutdown (NEW)	`None`

All maxllm_gate methods are async and should be used with await.

Configuration Classes

from maxllm_gate.config import maxllm_gate_config, KeyConfig

# Programmatic config
config = maxllm_gate_config(
    keys=[
        KeyConfig(
            api_key="sk-...",
            provider="openai",
            models={
                "gpt-4o-mini": {"tpm_limit": 90000, "rpm_limit": 500},
            },
        )
    ],
    strategy="balanced",
    max_retries=3,
)

client = maxllm_gate(config=config)

Validation

from maxllm_gate.validation import validate_chat_request, ChatRequest, ChatMessage

# Validate before sending
request = validate_chat_request(
    model="gpt-4",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    temperature=0.7,
    max_tokens=1024,
)

# Access validated data
print(request.model)  # "gpt-4"
print(request.messages[0].role)  # "user"

Testing

# Install dev dependencies
pip install maxllm_gate[dev]

# Run all tests
pytest

# Run specific test file
pytest tests/test_sdk.py

# With coverage report
pytest --cov=maxllm_gate --cov-report=html

# Run only SDK tests (fast, no server deps needed)
pytest tests/test_sdk.py -v

Test Structure

tests/
├── conftest.py             # Shared fixtures
├── test_api.py             # FastAPI route tests
├── test_scheduler.py       # Gateway scheduler tests
├── test_sdk.py             # SDK client tests
├── test_strategies.py      # Gateway routing strategy tests
├── test_token_bucket.py    # Rate limiting bucket tests
└── test_token_estimator.py # Token estimation tests

Examples

See the examples/ directory for more:

basic_usage.py - Gateway HTTP example
simple_async.py - Minimal async SDK usage
concurrent_requests.py - Concurrent request handling
multi_key_config.py - Multiple keys and providers
priority_requests.py - Gateway request priorities

Architecture

maxllm_gate is built as a scheduling layer on top of LiteLLM:

┌──────────────────────────────────────────┐
│            Your Application              │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│          maxllm_gate (Scheduler)         │
│  • Rate limiting (token bucket)          │
│  • Smart routing (SDK and gateway)       │
│  • Queue management                      │
│  • Latency tracking                      │
│  • Input validation                      │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│        LiteLLM (Execution)               │
│  • Provider abstraction                  │
│  • API key management                    │
│  • Retry logic                           │
└────────────────┬─────────────────────────┘
                 │
     ┌───────────┼───────────┐
     ▼           ▼           ▼
  ┌─────┐   ┌─────┐     ┌─────┐
  │ GPT │   │Groq │     │ ... │
  └─────┘   └─────┘     └─────┘

Contributing

Contributions welcome! Please:

Fork the repo
Create a feature branch
Add tests for new features
Ensure all tests pass: pytest
Submit a pull request

Roadmap

Cost tracking and optimization
Streaming with backpressure control
Web dashboard UI
Batch request helpers
Provider-specific config presets
Custom retry strategies

FAQ

Q: Why use maxllm_gate instead of calling LiteLLM directly?

A: maxllm_gate adds intelligent scheduling, rate limiting, and multi-key management. It prevents 429 errors and maximizes throughput across multiple keys/providers.

Q: Does this work with OpenAI's official client?

A: maxllm_gate uses LiteLLM under the hood, which supports OpenAI and 100+ other providers. The API is similar but not identical to OpenAI's client.

Q: What happens when all keys are rate limited?

A: maxllm_gate automatically defers the request and waits for capacity to become available. No 429 errors!

Q: Can I use this in production?

A: Yes. Version 0.7.0 includes the SDK, the optional FastAPI gateway, Redis support, graceful shutdown, and automated tests for the current code paths.

Q: Which strategy should I use?

A: Use least_utilized if you want one setting that works everywhere. Use balanced for direct SDK usage when you want routing to account for utilization, latency, recent errors, and freshness. Use token_aware only in the gateway.

Q: Do I need Redis?

A: No, Redis is optional. It's recommended for production/distributed deployments but maxllm_gate works fine with in-memory state for single-instance deployments.

License

MIT License - see LICENSE for details.

Acknowledgments

Built on top of LiteLLM for provider abstraction
Token estimation using tiktoken
Input validation with Pydantic

maxllm_gate v0.7.0 - Maximum LLM throughput with zero 429 errors.

Documentation • Issues • PyPI

Made for the AI community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.0

Mar 31, 2026

0.6.0

Mar 31, 2026

0.5.0

Mar 31, 2026

0.4.0

Mar 29, 2026

0.3.0

Mar 29, 2026

0.2.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maxllm_gate-0.8.0.tar.gz (75.5 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

maxllm_gate-0.8.0-py3-none-any.whl (69.3 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file maxllm_gate-0.8.0.tar.gz.

File metadata

Download URL: maxllm_gate-0.8.0.tar.gz
Upload date: Mar 31, 2026
Size: 75.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for maxllm_gate-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`777783b788984eb679ce4b1a8a135f8ed5ece87db94cd1f1d8ae80f9cf022955`
MD5	`d85f1bfa4a38f57176fffe9e70ff3081`
BLAKE2b-256	`f7f819b28ea32281af7789e7338ce82242b013ca951cf45e41ba3457c3a41ede`

See more details on using hashes here.

File details

Details for the file maxllm_gate-0.8.0-py3-none-any.whl.

File metadata

Download URL: maxllm_gate-0.8.0-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 69.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for maxllm_gate-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1556c431e3589066eb017b079f5a7ced3b81a8bd19cadff5cfa6c5c164d8fbdd`
MD5	`21c15e5ccb6971ebe67dad639154fd3d`
BLAKE2b-256	`e5c8f02473c43523e8e324e3a115a06687865ba66269e2c7aca4df1e6580933f`

See more details on using hashes here.

maxllm-gate 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

maxllm_gate

Overview

What it does automatically:

Installation

Quick Start

1. Create .env

2. Use it (Async-only)

Async Usage (Native)

Configuration

Environment Variables

Routing Strategies

Environment Variables

FastAPI Gateway

Supported Providers

Provider Configuration

How It Works

Key Features

1. Deferred Execution (No 429 Errors)

2. Input Validation

3. Graceful Shutdown

Production Deployment

Redis Backend (Recommended for Production)

Docker Deployment

Monitoring & Observability

Health Checks

API Reference

ChatResponse

maxllm_gate Methods

Configuration Classes

Validation

Testing

Test Structure

Examples

Architecture

Contributing

Roadmap

FAQ

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Create `.env`