Production-grade LLM routing with automatic fallbacks, canary deployments, cost tracking, semantic routing, caching, and CloudWatch integration

These details have not been verified by PyPI

Project links

Project description

Agentic AI Gateway

Production-grade LLM routing with automatic fallbacks, canary deployments, and multi-provider support.

Created by Tyler Canton | PyPI | Documentation

The Problem

When you call an LLM directly, you're one API error away from a crashed application:

# If Claude is down, rate-limited, or throws an error... your app crashes
response = bedrock.invoke_model(modelId='anthropic.claude-3-sonnet...')

AWS Bedrock, OpenAI, and other LLM providers don't offer:

Automatic fallback to alternative models
Traffic splitting for A/B testing new models
Centralized monitoring across models
Runtime configuration without redeployment

The Solution

Agentic AI Gateway sits between your application and LLM providers:

┌─────────────┐     ┌─────────────┐     ┌──────────────────────────────┐
│  Your App   │────▶│ Agentic AI Gateway │────▶│ Claude (Primary)             │
└─────────────┘     │             │     │ Llama (Fallback)             │
                    │ - Routing   │     │ GPT-4 (Cross-provider backup)│
                    │ - Fallback  │     └──────────────────────────────┘
                    │ - Canary    │
                    │ - Metrics   │
                    └─────────────┘

Installation

# For AWS Bedrock
pip install agentic-ai-gateway[bedrock]

# For OpenAI
pip install agentic-ai-gateway[openai]

# For both (cross-provider fallback)
pip install agentic-ai-gateway[all]

Quick Start

AWS Bedrock

from agentic_ai_gateway import create_bedrock_gateway

# Create gateway with automatic fallback
gateway = create_bedrock_gateway(
    primary_model="anthropic.claude-3-sonnet-20240229-v1:0",
    fallback_models=["anthropic.claude-3-haiku-20240307-v1:0"],
    region="us-east-1"
)

# Use it - if Claude Sonnet fails, automatically tries Haiku
response = gateway.invoke("What is the capital of France?")
print(response.content)  # "The capital of France is Paris."
print(response.model_used)  # Shows which model actually responded
print(response.fallback_used)  # True if primary failed

OpenAI

from agentic_ai_gateway import create_openai_gateway

gateway = create_openai_gateway(
    primary_model="gpt-4o",
    fallback_models=["gpt-4o-mini"],
    api_key="sk-..."
)

response = gateway.invoke("Explain quantum computing")

Cross-Provider Fallback

from agentic_ai_gateway import create_multi_provider_gateway

# Ultimate resilience: fall back across providers
gateway = create_multi_provider_gateway(
    primary_model="anthropic.claude-3-sonnet-20240229-v1:0",
    fallback_models=[
        "anthropic.claude-3-haiku-20240307-v1:0",  # Bedrock fallback
        "gpt-4o-mini",  # OpenAI fallback
    ],
    bedrock_region="us-east-1",
    openai_api_key="sk-..."
)

response = gateway.invoke("Summarize this document...")
# Tries Claude Sonnet → Claude Haiku → GPT-4o Mini

Canary Deployments

Test new models on a percentage of traffic:

from agentic_ai_gateway import create_bedrock_gateway

gateway = create_bedrock_gateway(
    primary_model="anthropic.claude-3-sonnet-20240229-v1:0",
    canary_model="anthropic.claude-3-5-sonnet-20241022-v2:0",
    canary_percentage=10,  # 10% traffic to Claude 3.5
    fallback_models=["anthropic.claude-3-haiku-20240307-v1:0"]
)

# 90% of requests go to Claude 3 Sonnet
# 10% of requests go to Claude 3.5 Sonnet (canary)
response = gateway.invoke("Hello!")
print(response.canary_used)  # True if canary was selected

Gradual Rollout

# Week 1: 5% canary
gateway.update_config(canary_percentage=5)

# Week 2: 20% canary (metrics look good)
gateway.update_config(canary_percentage=20)

# Week 3: 50% canary
gateway.update_config(canary_percentage=50)

# Week 4: Promote canary to primary
gateway.update_config(
    primary_model="anthropic.claude-3-5-sonnet-20241022-v2:0",
    canary_model=None,
    canary_percentage=0
)

Monitoring

Built-in metrics tracking:

# After running some requests
metrics = gateway.get_metrics()

print(metrics)
# {
#     "total_invocations": 1000,
#     "total_errors": 12,
#     "error_rate": 0.012,
#     "fallback_rate": 0.03,
#     "avg_latency_ms": 1250,
#     "by_model": {
#         "anthropic.claude-3-sonnet...": {"invocations": 900, "errors": 10},
#         "anthropic.claude-3-haiku...": {"invocations": 100, "errors": 2}
#     }
# }

CloudWatch Integration

import boto3
from agentic_ai_gateway import AgenticGateway, AgenticGatewayConfig, BedrockProvider

class CloudWatchMetrics:
    def __init__(self, namespace="AgenticGateway"):
        self.cloudwatch = boto3.client("cloudwatch")
        self.namespace = namespace

    def record(self, model_id, latency_ms, success, is_canary, is_fallback, error=None):
        self.cloudwatch.put_metric_data(
            Namespace=self.namespace,
            MetricData=[
                {
                    "MetricName": "Invocations",
                    "Value": 1,
                    "Dimensions": [
                        {"Name": "ModelId", "Value": model_id},
                        {"Name": "Success", "Value": str(success)}
                    ]
                },
                {
                    "MetricName": "Latency",
                    "Value": latency_ms,
                    "Unit": "Milliseconds",
                    "Dimensions": [{"Name": "ModelId", "Value": model_id}]
                }
            ]
        )

# Use custom metrics
gateway = AgenticGateway(
    config=AgenticGatewayConfig(
        primary_model="anthropic.claude-3-sonnet-20240229-v1:0",
        fallback_models=["anthropic.claude-3-haiku-20240307-v1:0"]
    ),
    providers=[BedrockProvider()],
    metrics=CloudWatchMetrics()
)

Custom Providers

Add support for any LLM provider:

from agentic_ai_gateway import AgenticGateway, AgenticGatewayConfig, LLMProvider

class AnthropicDirectProvider(LLMProvider):
    def __init__(self, api_key: str):
        import anthropic
        self.client = anthropic.Anthropic(api_key=api_key)

    def supports_model(self, model_id: str) -> bool:
        return "claude" in model_id and "anthropic." not in model_id

    def invoke(self, model_id: str, prompt: str, **kwargs):
        response = self.client.messages.create(
            model=model_id,
            max_tokens=kwargs.get("max_tokens", 1024),
            messages=[{"role": "user", "content": prompt}]
        )
        content = response.content[0].text
        return content, response.usage.input_tokens, response.usage.output_tokens

# Use it
gateway = AgenticGateway(
    config=AgenticGatewayConfig(primary_model="claude-3-opus-20240229"),
    providers=[AnthropicDirectProvider(api_key="sk-...")]
)

Multi-Agent Tool Calling

For multi-agent workflows that need tool calling, use the converse() method:

from agentic_ai_gateway import create_bedrock_gateway

gateway = create_bedrock_gateway(
    primary_model="anthropic.claude-3-sonnet-20240229-v1:0",
    fallback_models=["anthropic.claude-3-haiku-20240307-v1:0"]
)

# Define tools
tool_config = {
    "tools": [
        {
            "toolSpec": {
                "name": "get_patient_data",
                "description": "Retrieve patient records",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "patient_id": {"type": "string"}
                        },
                        "required": ["patient_id"]
                    }
                }
            }
        }
    ]
}

# Use converse() with tool calling - includes automatic fallback
result = gateway.converse(
    messages=[{
        "role": "user",
        "content": [{"text": "Look up patient P001"}]
    }],
    system=[{"text": "You are a healthcare assistant."}],
    tool_config=tool_config,
    inference_config={"maxTokens": 4096, "temperature": 0.1}
)

print(f"Model used: {result['model_used']}")
print(f"Fallback used: {result['fallback_used']}")

# Access raw Bedrock response
response = result["response"]

RAG Pipeline Integration

Integrate with your RAG pipeline for resilient document Q&A:

from agentic_ai_gateway import create_bedrock_gateway

gateway = create_bedrock_gateway(
    primary_model="anthropic.claude-3-sonnet-20240229-v1:0",
    fallback_models=["anthropic.claude-3-haiku-20240307-v1:0"],
    canary_model="anthropic.claude-3-5-sonnet-20241022-v2:0",
    canary_percentage=10  # A/B test new model
)

def rag_query(question: str, context_chunks: list[str]) -> dict:
    """RAG query with automatic fallback."""
    prompt = f"""Answer based on context:

Context:
{chr(10).join(context_chunks)}

Question: {question}"""

    response = gateway.invoke(prompt, max_tokens=500, temperature=0.3)

    return {
        "answer": response.content,
        "model_used": response.model_used,
        "fallback_used": response.fallback_used
    }

Async Support

import asyncio
from agentic_ai_gateway import create_bedrock_gateway

gateway = create_bedrock_gateway()

async def main():
    response = await gateway.ainvoke("Hello async world!")
    print(response.content)

asyncio.run(main())

Streaming Support (v0.2.0+)

Stream tokens in real-time for chat interfaces and SSE endpoints:

Basic Streaming

from agentic_ai_gateway import create_bedrock_gateway

gateway = create_bedrock_gateway(
    primary_model="anthropic.claude-3-sonnet-20240229-v1:0",
    fallback_models=["anthropic.claude-3-haiku-20240307-v1:0"]
)

# Synchronous streaming
for chunk in gateway.invoke_stream("Tell me a story"):
    if chunk["type"] == "start":
        print(f"Using model: {chunk['model_used']}")
    elif chunk["type"] == "token":
        print(chunk["content"], end="", flush=True)
    elif chunk["type"] == "done":
        print(f"\n\nCompleted in {chunk['latency_ms']}ms")
        print(f"Tokens: {chunk['output_tokens']}")

Async Streaming (for FastAPI/aiohttp)

import asyncio
from agentic_ai_gateway import create_bedrock_gateway

gateway = create_bedrock_gateway()

async def stream_response():
    async for chunk in gateway.ainvoke_stream("Explain quantum computing"):
        if chunk["type"] == "token":
            yield chunk["content"]

FastAPI SSE Integration

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json

app = FastAPI()

@app.post("/api/v1/query/stream")
async def stream_query(request: QueryRequest):
    async def generate():
        # Emit start event
        yield f"data: {json.dumps({'type': 'start'})}\n\n"

        full_response = ""
        async for chunk in gateway.ainvoke_stream(request.prompt):
            if chunk["type"] == "token":
                full_response += chunk.get("content", "")
                yield f"data: {json.dumps({'type': 'token', 'content': chunk.get('content', '')})}\n\n"
            elif chunk["type"] == "done":
                yield f"data: {json.dumps({'type': 'done', 'model_used': chunk.get('model_used', 'unknown'), 'fallback_used': chunk.get('fallback_used', False)})}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Streaming Event Types

The streaming API yields dictionaries with the following types:

Event Type	Description	Fields
`start`	Stream started	`model_used`, `fallback_used`, `canary_used`
`token`	Content token	`content` (the token text)
`done`	Stream complete	`model_used`, `latency_ms`, `input_tokens`, `output_tokens`, `fallback_used`
`error`	Error occurred	`error` (error message)

Streaming with Fallback

Streaming includes automatic fallback support. If the primary model fails before streaming begins, the gateway automatically tries fallback models:

# If Claude Sonnet fails during connection, automatically tries Haiku
for chunk in gateway.invoke_stream("Hello"):
    if chunk["type"] == "start":
        if chunk["fallback_used"]:
            print(f"⚠️ Using fallback model: {chunk['model_used']}")
    # ... handle other events

Note: Once streaming has started successfully, if an error occurs mid-stream, the gateway will emit an error event rather than attempting fallback (since partial content has already been delivered).

Examples

See the examples/ directory for complete integration examples:

bedrock_example.py - Basic Bedrock usage with fallbacks and canary
multiagent_example.py - Multi-agent tool calling with agentic loop
rag_example.py - RAG pipeline integration
streaming_example.py - Real-time token streaming with SSE

Why Not Just Use...

Approach	Limitation
Direct API calls	No fallback, crashes on errors
Try/except wrapper	Manual, error-prone, no canary
API Gateway (AWS)	Doesn't understand LLM-specific routing
SageMaker endpoints	Overkill for routing, designed for hosting

Agentic AI Gateway is purpose-built for LLM routing:

Model-aware fallback chains
Canary deployments with gradual rollout
Multi-provider support (Bedrock + OpenAI + custom)
Zero infrastructure (it's just Python code)

Author

Tyler Canton - AI/ML Engineer specializing in production LLM systems

GitHub: @tyler-canton
PyPI: agentic-ai-gateway

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Apr 10, 2026

0.5.0

Mar 31, 2026

This version

0.4.0

Mar 28, 2026

0.3.0

Mar 28, 2026

0.2.1

Mar 14, 2026

0.2.0

Mar 14, 2026

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_ai_gateway-0.4.0.tar.gz (6.2 MB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentic_ai_gateway-0.4.0-py3-none-any.whl (41.5 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file agentic_ai_gateway-0.4.0.tar.gz.

File metadata

Download URL: agentic_ai_gateway-0.4.0.tar.gz
Upload date: Mar 28, 2026
Size: 6.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for agentic_ai_gateway-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`e3cf056c13725ad4620eb24bf102bb455a198d312b86a32c5a056a69fec29151`
MD5	`b3a95bc46f21bdae73f43452b27a6653`
BLAKE2b-256	`dc52fd19fd88cffcf49c1e88ab499a1e8a47756385c6faefead629d0d0f69a62`

See more details on using hashes here.

File details

Details for the file agentic_ai_gateway-0.4.0-py3-none-any.whl.

File metadata

Download URL: agentic_ai_gateway-0.4.0-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 41.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for agentic_ai_gateway-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef34938b0b5604317b56b98d56af881b1804aa2282fce2ffa3b656d29a6d2a6e`
MD5	`9441c2c541b92d49b948c5a24b338a19`
BLAKE2b-256	`cef967fcf1204d4124645a80165dbe9164fe3873493f3839224f5c261b0b4547`

See more details on using hashes here.

agentic-ai-gateway 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Agentic AI Gateway

The Problem

The Solution

Installation

Quick Start

AWS Bedrock

OpenAI

Cross-Provider Fallback

Canary Deployments

Gradual Rollout

Monitoring

CloudWatch Integration

Custom Providers

Multi-Agent Tool Calling

RAG Pipeline Integration

Async Support

Streaming Support (v0.2.0+)

Basic Streaming

Async Streaming (for FastAPI/aiohttp)

FastAPI SSE Integration

Streaming Event Types

Streaming with Fallback

Examples

Why Not Just Use...

Author

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes