Skip to main content

Adaptive rate-limit-aware LLM routing. Bring your own clients.

Project description

tokentaxi

Adaptive rate-limit-aware LLM routing. Bring your own clients.

tokentaxi is a lightweight Python library that sits between your application and your LLM providers. It intelligently routes every request based on real-time provider health, rate-limit headroom, latency, and request priority — with zero network hops and no external dependencies beyond an optional Redis connection.


The core idea — Bring Your Own Client (BYOC)

Every other routing solution asks you to replace your LLM SDK. tokentaxi doesn't. You keep your existing, fully configured clients. The router wraps them and adds routing intelligence on top.

# What you'd write today — fragile, manual, scattered
try:
    response = openai_client.chat(...)
except RateLimitError:
    try:
        response = anthropic_client.chat(...)
    except RateLimitError:
        response = gemini_client.chat(...)

# What you write with tokentaxi — once, tested, intelligent
router = LLMRouter.from_dict({"providers": [...]})
response = await router.chat(RouterRequest(messages=messages))

Features

Feature Description
Adaptive rate-limit-aware routing Tracks RPM and TPM in a rolling 60-second window. Routes to the provider with the most headroom.
Automatic fallback Transparently retries with the next-ranked provider on any failure.
Circuit breaker Trips per-provider circuits after N failures. Auto-recovers after cooldown. Redis-backed for multi-instance.
Latency-aware scoring (EMA) Tracks latency per-provider using an exponential moving average. Slower providers get lower scores.
Quota exhaustion prediction Proactively shifts load before a provider hits its hard limit.
Session affinity Pass a session_id to pin all requests in a conversation to the same provider.
Priority lanes Tag requests "high", "normal", or "low". High-priority traffic gets the best available provider.
Provider pinning Override the router for a specific call via force_provider.
Static preference weights Express a preference for one provider over others via a weight parameter.
BYOC Register your own pre-configured SDK clients. The router wraps them, not the other way around.

Installation

# Core library — in-memory state, no extra deps
pip install tokentaxi

# Multi-instance deployments (Redis-backed state)
pip install "tokentaxi[redis]"

# Local real-time dashboard
pip install "tokentaxi[dashboard]"

# CLI (status command, watch mode)
pip install "tokentaxi[cli]"

# YAML config support
pip install "tokentaxi[yaml]"

# Everything
pip install "tokentaxi[all]"

Quick Start

From a dictionary

from tokentaxi import LLMRouter, RouterRequest

router = LLMRouter.from_dict({
    "providers": [
        {"name": "openai",    "api_key": "sk-...",    "model": "gpt-4o",            "rpm_limit": 500, "tpm_limit": 200_000},
        {"name": "anthropic", "api_key": "sk-ant-...", "model": "claude-sonnet-4-5", "rpm_limit": 50,  "tpm_limit": 200_000},
        {"name": "groq",      "api_key": "gsk-...",   "model": "llama-3.1-70b",     "rpm_limit": 30,  "tpm_limit": 100_000},
    ]
})

response = await router.chat(RouterRequest(
    messages=[{"role": "user", "content": "Summarize this article..."}],
    priority="normal",
))

print(response.content)
print(response.provider)    # "anthropic"
print(response.latency_ms)  # 310.4
print(response.attempts)    # 1

BYOC — Bring Your Own Client

import openai
import anthropic
from tokentaxi import LLMRouter

openai_client    = openai.AsyncOpenAI(api_key="sk-...", timeout=30, max_retries=0)
anthropic_client = anthropic.AsyncAnthropic(api_key="sk-ant-...", timeout=30)

router = LLMRouter.from_dict({"providers": []})
router.register("openai",    client=openai_client,    model="gpt-4o",            rpm=500, tpm=200_000)
router.register("anthropic", client=anthropic_client, model="claude-sonnet-4-5", rpm=50,  tpm=200_000)

response = await router.chat(RouterRequest(messages=[{"role": "user", "content": "Hello"}]))

From a YAML file

router = LLMRouter.from_yaml("router.yaml")

From environment variables

# Reads OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, GROQ_API_KEY
router = LLMRouter.from_env()

Priority Lanes

# User-facing — use best available provider
response = await router.chat(RouterRequest(messages=messages, priority="high"))

# Background batch job — don't burn premium quota
response = await router.chat(RouterRequest(messages=messages, priority="low"))

Session Affinity

# All requests with the same session_id go to the same provider
response = await router.chat(RouterRequest(
    messages=conversation_history,
    session_id="user-session-abc123",
))

Provider Pinning

# Force a specific provider — fallback still applies if it fails
response = await router.chat(RouterRequest(
    messages=messages,
    force_provider="anthropic",
))

Streaming

async for chunk in router.stream(RouterRequest(messages=messages)):
    print(chunk, end="", flush=True)

Callbacks

async def on_route(event: RouteEvent):
    print(f"Routed to {event.provider} | latency: {event.latency_ms}ms")
    # Send to Datadog, Sentry, Slack, etc.

router = LLMRouter.from_yaml("router.yaml", on_route=on_route)

Provider Status

In code

status = await router.status()
# {
#   "openai":    {"rpm_used": 423, "rpm_limit": 500, "headroom_pct": 15.4, "circuit_open": False, "avg_latency_ms": 312},
#   "anthropic": {"rpm_used": 12,  "rpm_limit": 50,  "headroom_pct": 76.0, "circuit_open": False, "avg_latency_ms": 410},
# }

FastAPI integration

@app.get("/llm/status")
async def llm_status():
    return await router.status()

CLI

tokentaxi status --config router.yaml
tokentaxi status --watch --interval 3    # live-updating like htop

Dashboard

pip install "tokentaxi[dashboard]"
tokentaxi dashboard --config router.yaml
# → open http://localhost:8501

Scoring Formula

score = (capacity_score × w_capacity) + (latency_score × w_latency) + (static_score × w_static)

capacity_score = min(rpm_headroom, tpm_headroom)
rpm_headroom   = 1 - (rpm_used / rpm_limit)
tpm_headroom   = 1 - ((tpm_used + estimated_tokens) / tpm_limit)
latency_score  = max(0, 1 - (latency_ema_ms / 3000))
static_score   = provider.weight

# Default weights (normal priority)
w_capacity = 0.5  |  w_latency = 0.3  |  w_static = 0.2

# High priority
w_capacity = 0.5  |  w_latency = 0.4  |  w_static = 0.1

# Low priority
w_capacity = 0.3  |  w_latency = 0.1  |  w_static = 0.6

Multi-Instance Deployments

pip install "tokentaxi[redis]"
# router.yaml
redis_url: "redis://localhost:6379"

With Redis, all router instances share the same accurate picture of provider state — sliding window usage, circuit breaker status, and session affinity. Scale horizontally without coordination.


Project Structure

tokentaxi/
├── __init__.py          # Public API exports
├── router.py            # LLMRouter — main class
├── config.py            # RouterConfig, RoutingWeights, CircuitBreakerConfig
├── models.py            # RouterRequest, RouterResponse, ProviderConfig, RouteEvent
├── exceptions.py        # AllProvidersFailed, NoProvidersConfigured, TokenLimitExceeded
├── constants.py         # Default weights, thresholds, window sizes
├── cli.py               # typer CLI (status, dashboard commands)
├── _dashboard.py        # Streamlit dashboard
├── engine/
│   ├── scorer.py        # Provider scoring (capacity + latency + static weight)
│   ├── estimator.py     # Pre-flight token count estimation (tiktoken)
│   └── predictor.py     # Quota exhaustion prediction
├── providers/
│   ├── base.py          # BaseProvider abstract class
│   ├── registry.py      # ProviderRegistry (thread-safe)
│   ├── openai.py        # OpenAI adapter
│   ├── anthropic.py     # Anthropic adapter
│   ├── gemini.py        # Gemini adapter
│   └── groq.py          # Groq adapter
├── state/
│   ├── base.py          # AbstractStateBackend interface
│   ├── memory.py        # InMemoryStateBackend (default, zero deps)
│   └── redis.py         # RedisStateBackend (multi-instance)
└── breaker/
    └── circuit.py       # CircuitBreaker (per-provider)

tests/
├── conftest.py          # Shared fixtures
├── test_scorer.py       # Scorer unit tests
├── test_circuit_breaker.py
├── test_state_memory.py
├── test_predictor.py
└── test_router.py       # Integration tests (mocked providers)

examples/
├── quickstart.py        # Dict config quickstart
├── byoc.py              # BYOC example
├── streaming.py         # Streaming example
└── router.yaml          # YAML config example

Running Tests

pip install "tokentaxi[dev]"
pytest

Publishing

pip install hatch twine
hatch build
twine upload dist/*

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokentaxi-1.1.4.tar.gz (40.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokentaxi-1.1.4-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file tokentaxi-1.1.4.tar.gz.

File metadata

  • Download URL: tokentaxi-1.1.4.tar.gz
  • Upload date:
  • Size: 40.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tokentaxi-1.1.4.tar.gz
Algorithm Hash digest
SHA256 4860f05c02778d1d33afeda3502aa1c53c17d17c832cc5c88c2455ba929f6a97
MD5 a38b2758cbf53fb51ef28508cdec2ae5
BLAKE2b-256 ef2f6c1daf08680fc3606ae6ca59d382256119bda8640e39b3de70a89725261d

See more details on using hashes here.

File details

Details for the file tokentaxi-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: tokentaxi-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 43.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tokentaxi-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c4e646234dbc05c25e212b3abf8c33fa6135a29edeb8e034b7c01bdd66e04d4c
MD5 f67ad52ac4ad65f091d5c8d2b717acbf
BLAKE2b-256 4caa166dd56c5ee2cb6c860c5920f2f1f0c7435bf2f1429e3c9c3f8b0c5229cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page