Adaptive rate-limit-aware LLM routing. Bring your own clients.

These details have not been verified by PyPI

Project links

Project description

tokentaxi

Adaptive rate-limit-aware LLM routing. Bring your own clients.

tokentaxi is a lightweight Python library that sits between your application and your LLM providers. It intelligently routes every request based on real-time provider health, rate-limit headroom, latency, and request priority — with zero network hops and no external dependencies beyond an optional Redis connection.

The core idea — Bring Your Own Client (BYOC)

Every other routing solution asks you to replace your LLM SDK. tokentaxi doesn't. You keep your existing, fully configured clients. The router wraps them and adds routing intelligence on top.

# What you'd write today — fragile, manual, scattered
try:
    response = openai_client.chat(...)
except RateLimitError:
    try:
        response = anthropic_client.chat(...)
    except RateLimitError:
        response = gemini_client.chat(...)

# What you write with tokentaxi — once, tested, intelligent
router = LLMRouter.from_dict({"providers": [...]})
response = await router.chat(RouterRequest(messages=messages))

Features

Feature	Description
Adaptive rate-limit-aware routing	Tracks RPM and TPM in a rolling 60-second window. Routes to the provider with the most headroom.
Automatic fallback	Transparently retries with the next-ranked provider on any failure.
Circuit breaker	Trips per-provider circuits after N failures. Auto-recovers after cooldown. Redis-backed for multi-instance.
Latency-aware scoring (EMA)	Tracks latency per-provider using an exponential moving average. Slower providers get lower scores.
Quota exhaustion prediction	Proactively shifts load before a provider hits its hard limit.
Session affinity	Pass a `session_id` to pin all requests in a conversation to the same provider.
Priority lanes	Tag requests `"high"`, `"normal"`, or `"low"`. High-priority traffic gets the best available provider.
Provider pinning	Override the router for a specific call via `force_provider`.
Static preference weights	Express a preference for one provider over others via a `weight` parameter.
BYOC	Register your own pre-configured SDK clients. The router wraps them, not the other way around.

Installation

# Core library — in-memory state, no extra deps
pip install tokentaxi

# Multi-instance deployments (Redis-backed state)
pip install "tokentaxi[redis]"

# Local real-time dashboard
pip install "tokentaxi[dashboard]"

# CLI (status command, watch mode)
pip install "tokentaxi[cli]"

# YAML config support
pip install "tokentaxi[yaml]"

# Everything
pip install "tokentaxi[all]"

Quick Start

From a dictionary

from tokentaxi import LLMRouter, RouterRequest

router = LLMRouter.from_dict({
    "providers": [
        {"name": "openai",    "api_key": "sk-...",    "model": "gpt-4o",            "rpm_limit": 500, "tpm_limit": 200_000},
        {"name": "anthropic", "api_key": "sk-ant-...", "model": "claude-sonnet-4-5", "rpm_limit": 50,  "tpm_limit": 200_000},
        {"name": "groq",      "api_key": "gsk-...",   "model": "llama-3.1-70b",     "rpm_limit": 30,  "tpm_limit": 100_000},
    ]
})

response = await router.chat(RouterRequest(
    messages=[{"role": "user", "content": "Summarize this article..."}],
    priority="normal",
))

print(response.content)
print(response.provider)    # "anthropic"
print(response.latency_ms)  # 310.4
print(response.attempts)    # 1

BYOC — Bring Your Own Client

import openai
import anthropic
from tokentaxi import LLMRouter

openai_client    = openai.AsyncOpenAI(api_key="sk-...", timeout=30, max_retries=0)
anthropic_client = anthropic.AsyncAnthropic(api_key="sk-ant-...", timeout=30)

router = LLMRouter.from_dict({"providers": []})
router.register("openai",    client=openai_client,    model="gpt-4o",            rpm=500, tpm=200_000)
router.register("anthropic", client=anthropic_client, model="claude-sonnet-4-5", rpm=50,  tpm=200_000)

response = await router.chat(RouterRequest(messages=[{"role": "user", "content": "Hello"}]))

From a YAML file

router = LLMRouter.from_yaml("router.yaml")

From environment variables

# Reads OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, GROQ_API_KEY
router = LLMRouter.from_env()

Priority Lanes

# User-facing — use best available provider
response = await router.chat(RouterRequest(messages=messages, priority="high"))

# Background batch job — don't burn premium quota
response = await router.chat(RouterRequest(messages=messages, priority="low"))

Session Affinity

# All requests with the same session_id go to the same provider
response = await router.chat(RouterRequest(
    messages=conversation_history,
    session_id="user-session-abc123",
))

Provider Pinning

# Force a specific provider — fallback still applies if it fails
response = await router.chat(RouterRequest(
    messages=messages,
    force_provider="anthropic",
))

Streaming

async for chunk in router.stream(RouterRequest(messages=messages)):
    print(chunk, end="", flush=True)

Callbacks

async def on_route(event: RouteEvent):
    print(f"Routed to {event.provider} | latency: {event.latency_ms}ms")
    # Send to Datadog, Sentry, Slack, etc.

router = LLMRouter.from_yaml("router.yaml", on_route=on_route)

Provider Status

In code

status = await router.status()
# {
#   "openai":    {"rpm_used": 423, "rpm_limit": 500, "headroom_pct": 15.4, "circuit_open": False, "avg_latency_ms": 312},
#   "anthropic": {"rpm_used": 12,  "rpm_limit": 50,  "headroom_pct": 76.0, "circuit_open": False, "avg_latency_ms": 410},
# }

FastAPI integration

@app.get("/llm/status")
async def llm_status():
    return await router.status()

CLI

tokentaxi status --config router.yaml
tokentaxi status --watch --interval 3    # live-updating like htop

Dashboard

pip install "tokentaxi[dashboard]"
tokentaxi dashboard --config router.yaml
# → open http://localhost:8501

Scoring Formula

score = (capacity_score × w_capacity) + (latency_score × w_latency) + (static_score × w_static)

capacity_score = min(rpm_headroom, tpm_headroom)
rpm_headroom   = 1 - (rpm_used / rpm_limit)
tpm_headroom   = 1 - ((tpm_used + estimated_tokens) / tpm_limit)
latency_score  = max(0, 1 - (latency_ema_ms / 3000))
static_score   = provider.weight

# Default weights (normal priority)
w_capacity = 0.5  |  w_latency = 0.3  |  w_static = 0.2

# High priority
w_capacity = 0.5  |  w_latency = 0.4  |  w_static = 0.1

# Low priority
w_capacity = 0.3  |  w_latency = 0.1  |  w_static = 0.6

Multi-Instance Deployments

pip install "tokentaxi[redis]"

# router.yaml
redis_url: "redis://localhost:6379"

With Redis, all router instances share the same accurate picture of provider state — sliding window usage, circuit breaker status, and session affinity. Scale horizontally without coordination.

Project Structure

tokentaxi/
├── __init__.py          # Public API exports
├── router.py            # LLMRouter — main class
├── config.py            # RouterConfig, RoutingWeights, CircuitBreakerConfig
├── models.py            # RouterRequest, RouterResponse, ProviderConfig, RouteEvent
├── exceptions.py        # AllProvidersFailed, NoProvidersConfigured, TokenLimitExceeded
├── constants.py         # Default weights, thresholds, window sizes
├── cli.py               # typer CLI (status, dashboard commands)
├── _dashboard.py        # Streamlit dashboard
├── engine/
│   ├── scorer.py        # Provider scoring (capacity + latency + static weight)
│   ├── estimator.py     # Pre-flight token count estimation (tiktoken)
│   └── predictor.py     # Quota exhaustion prediction
├── providers/
│   ├── base.py          # BaseProvider abstract class
│   ├── registry.py      # ProviderRegistry (thread-safe)
│   ├── openai.py        # OpenAI adapter
│   ├── anthropic.py     # Anthropic adapter
│   ├── gemini.py        # Gemini adapter
│   └── groq.py          # Groq adapter
├── state/
│   ├── base.py          # AbstractStateBackend interface
│   ├── memory.py        # InMemoryStateBackend (default, zero deps)
│   └── redis.py         # RedisStateBackend (multi-instance)
└── breaker/
    └── circuit.py       # CircuitBreaker (per-provider)

tests/
├── conftest.py          # Shared fixtures
├── test_scorer.py       # Scorer unit tests
├── test_circuit_breaker.py
├── test_state_memory.py
├── test_predictor.py
└── test_router.py       # Integration tests (mocked providers)

examples/
├── quickstart.py        # Dict config quickstart
├── byoc.py              # BYOC example
├── streaming.py         # Streaming example
└── router.yaml          # YAML config example

Running Tests

pip install "tokentaxi[dev]"
pytest

Publishing

pip install hatch twine
hatch build
twine upload dist/*

Licence

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.4

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokentaxi-1.1.4.tar.gz (40.4 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokentaxi-1.1.4-py3-none-any.whl (43.5 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file tokentaxi-1.1.4.tar.gz.

File metadata

Download URL: tokentaxi-1.1.4.tar.gz
Upload date: Mar 17, 2026
Size: 40.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tokentaxi-1.1.4.tar.gz
Algorithm	Hash digest
SHA256	`4860f05c02778d1d33afeda3502aa1c53c17d17c832cc5c88c2455ba929f6a97`
MD5	`a38b2758cbf53fb51ef28508cdec2ae5`
BLAKE2b-256	`ef2f6c1daf08680fc3606ae6ca59d382256119bda8640e39b3de70a89725261d`

See more details on using hashes here.

File details

Details for the file tokentaxi-1.1.4-py3-none-any.whl.

File metadata

Download URL: tokentaxi-1.1.4-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 43.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tokentaxi-1.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c4e646234dbc05c25e212b3abf8c33fa6135a29edeb8e034b7c01bdd66e04d4c`
MD5	`f67ad52ac4ad65f091d5c8d2b717acbf`
BLAKE2b-256	`4caa166dd56c5ee2cb6c860c5920f2f1f0c7435bf2f1429e3c9c3f8b0c5229cc`

See more details on using hashes here.

tokentaxi 1.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tokentaxi

The core idea — Bring Your Own Client (BYOC)

Features

Installation

Quick Start

From a dictionary

BYOC — Bring Your Own Client

From a YAML file

From environment variables

Priority Lanes

Session Affinity

Provider Pinning

Streaming

Callbacks

Provider Status

In code

FastAPI integration

CLI

Dashboard

Scoring Formula

Multi-Instance Deployments

Project Structure

Running Tests

Publishing

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes