Skip to main content

Intelligent AI proxy with multi-provider routing, semantic caching, and delta context buffers

Project description

AI Proxy — Intelligent Multi-Provider LLM Gateway

Lokalne proxy łączące 10 providerów, 15 modeli, NVIDIA Jetson Orin, i delta context buffer w jednym OpenAI-compatible API. Budżet $20–60/mies. zamiast $150+.

┌─────────────────────────────────────────────────┐
│  IDE (Roo Code / Cline / Continue.dev / Aider)  │
│           ↓ localhost:4000                       │
├─────────────────────────────────────────────────┤
│              AI Proxy (FastAPI)                  │
│  ┌─────────┐ ┌────────────┐ ┌───────────────┐  │
│  │Analyzer │→│  Router    │→│  LiteLLM      │  │
│  │(tier+   │ │(cost+      │ │(10 providers  │  │
│  │ caps)   │ │ fallbacks) │ │ 15 models)    │  │
│  └─────────┘ └────────────┘ └───────────────┘  │
│       ↑            ↑              ↑             │
│  Delta Buffer  Redis Cache  Budget Ledger       │
├─────────────────────────────────────────────────┤
│  Ollama (Jetson Orin / GPU / CPU)               │
└─────────────────────────────────────────────────┘

Features

  • Content-based routing — analyzes your prompt to pick the cheapest model that can handle the task (Opus 4.6 for architecture, Haiku 4.5 for typos)
  • 10 providers, 15 models — Anthropic, OpenAI, Google, DeepSeek, Groq, OpenRouter, Mistral, Together, Fireworks, Cerebras + local Ollama
  • Delta context buffer — watches code2llm output and sends only file diffs, not full context (saves 60–80% tokens)
  • Budget enforcement — daily/monthly USD limits with per-request caps
  • Fallback chains — if Anthropic is rate-limited, auto-fallback to OpenAI → DeepSeek → local
  • OpenAI-compatible API — drop-in replacement for any tool expecting OpenAI format
  • Docker + Podman + Quadlet — development, production, and systemd-native deployments

Quick Start

Option A: Local Python

git clone https://github.com/wronai/ai-proxy && cd ai-proxy
bash scripts/setup.sh
# Edit .env with your API keys
ai-proxy

Option B: Docker Compose

cp .env.example .env
# Edit .env with your API keys
docker compose up -d

Option C: Jetson Orin

docker build -f Dockerfile.jetson -t ai-proxy:jetson .
docker run --runtime nvidia --gpus all \
  -p 4000:4000 -p 11434:11434 \
  --env-file .env \
  ai-proxy:jetson

Test it

curl http://localhost:4000/health

curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-proxy-local-dev" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "balanced",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Model Routing Strategy

The proxy analyzes each prompt and picks the optimal model:

Task Type Tier Model Selected Cost/1M tokens
"Fix this typo" trivial Cerebras Llama 70B / DeepSeek V3 $0.27–$0.60
"What does this function do?" operational Haiku 4.5 / Gemini Flash $0.15–$1.00
"Implement a REST endpoint" standard Sonnet 4.6 / GPT-4.1 $2.00–$3.00
"Refactor auth across 20 files" complex Sonnet 4.6 / Gemini Pro $3.00–$10.00
"Debug this race condition step by step" deep Opus 4.6 / DeepSeek R1 $0.55–$5.00

Model Aliases

Use these as the model parameter for explicit routing:

Alias Routes To When to Use
cheap Haiku 4.5 Debug, validation, simple Q&A
balanced Sonnet 4.6 Default coding, implementation
premium Opus 4.6 Complex refactoring, architecture
free Gemini 2.5 Flash Planning, analysis (free tier)
local Qwen 3B (Ollama) Offline, privacy, autocomplete

Automatic Routing

Without a model alias, the proxy analyzes your message:

# Automatically routes to cheap model
curl -X POST localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -d '{"messages": [{"role": "user", "content": "What is a for loop?"}]}'

# Automatically routes to premium model
curl -X POST localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -d '{"messages": [{"role": "user", "content": "Refactor the entire auth module to microservices"}]}'

# Force a tier with header
curl -X POST localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -H "X-Task-Tier: deep" \
  -d '{"messages": [{"role": "user", "content": "Why does this deadlock?"}]}'

Delta Context Buffer

The proxy maintains a buffer of your project files (from code2llm output) and sends only diffs to the LLM, dramatically reducing token usage.

Setup

# Terminal 1: Generate code2llm output
pip install code2llm
code2llm ./ -f all -o ./project --no-chunk

# Terminal 2: Start the watcher
ai-proxy-client --watch ./project --proxy http://localhost:4000

# Terminal 3: Query with context injection
curl -X POST localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -H "X-Inject-Context: true" \
  -d '{"messages": [{"role": "user", "content": "Explain the auth module"}]}'

How It Works

  1. code2llm generates project analysis files in ./project/
  2. ai-proxy-client watches the directory with watchfiles
  3. On change, it computes a unified diff against the last-sent snapshot
  4. Only changed portions are sent to the proxy as a <context_delta> block
  5. When you add X-Inject-Context: true, the delta is injected into the system prompt

Before (full context every request): ~120K tokens × $3/1M = $0.36/request After (delta only): ~5K tokens × $3/1M = $0.015/request → 96% savings

IDE Integration

Roo Code

Settings → Provider: OpenAI Compatible

  • API Base: http://localhost:4000
  • API Key: (your master key from .env)
  • Sticky Models per mode:
    • Architect → free
    • Code → balanced
    • Debug → cheap
    • Custom Opus → premium

Cline / Continue.dev / Aider

Same pattern: point API base to http://localhost:4000 with your master key.

Deployment

Docker Compose (Development)

docker compose up -d          # proxy + redis + ollama
docker compose logs -f proxy  # watch logs

Docker Compose + Traefik (Production)

docker compose -f docker-compose.prod.yml up -d
# Access via https://ai-proxy.local

Podman Quadlet (Systemd-native)

# Copy quadlet files
mkdir -p ~/.config/containers/systemd
cp quadlet/*.container quadlet/*.network ~/.config/containers/systemd/

# Build and tag image
podman build -t localhost/ai-proxy:latest .

# Create config dir
mkdir -p ~/.config/ai-proxy
cp .env ~/.config/ai-proxy/.env

# Enable and start
systemctl --user daemon-reload
systemctl --user start ai-proxy
systemctl --user status ai-proxy

Jetson Orin

The Jetson Dockerfile bundles Ollama + AI Proxy in a single container:

docker build -f Dockerfile.jetson -t ai-proxy:jetson .
docker run --runtime nvidia --gpus all \
  -p 4000:4000 -p 11434:11434 \
  -v ~/ollama:/root/.ollama \
  --env-file .env \
  ai-proxy:jetson

Models available on Jetson Orin 8GB:

  • qwen2.5-coder:1.5b — autocomplete (~1GB, ~30 tok/s)
  • qwen2.5-coder:3b — code generation (~2GB, ~18 tok/s)
  • phi3:3.8b — general tasks (~2.5GB, ~15 tok/s)

API Reference

POST /v1/chat/completions

OpenAI-compatible. Extra features:

Header Description
X-Task-Tier Force tier: trivial|operational|standard|complex|deep
X-Inject-Context true to inject latest code2llm delta

Response includes _proxy metadata:

{
  "choices": [...],
  "_proxy": {
    "model_id": "anthropic/sonnet-4.6",
    "tier": "standard",
    "cost_usd": 0.000045,
    "routing_reason": "tier=standard, cost=$0.0000",
    "elapsed_ms": 1234.5,
    "fallback_index": 0
  }
}

GET /v1/models

List all available models with pricing and capabilities.

GET /v1/budget

Current spend vs. limits.

POST /v1/context/delta

Receive context delta from the watcher client.

GET /v1/context/stats

Delta buffer statistics.

Testing

# Unit tests (no external services needed)
pytest tests/ --ignore=tests/test_e2e.py -v

# E2E tests (mock LiteLLM, no real API calls)
pytest tests/test_e2e.py -v -m e2e

# All tests with coverage
pytest tests/ -v --cov=ai_proxy --cov-report=html

Budget Examples

$25/month (casual, 4h/day)

DAILY_BUDGET_USD=1.5
MONTHLY_BUDGET_USD=25

Autocomplete: Ollama local ($0) → Planning: Gemini free ($0) → Coding: Sonnet 4.6 ($15) → Complex: skip Opus, use DeepSeek R1 ($5)

$60/month (intensive, 8h/day)

DAILY_BUDGET_USD=3.0
MONTHLY_BUDGET_USD=60

Full model spectrum with Opus 4.6 for 2–3 complex tasks/week.

Project Structure

ai-proxy/
├── src/ai_proxy/
│   ├── main.py              # FastAPI app + OpenAI-compatible endpoint
│   ├── config.py             # Pydantic settings from .env
│   ├── providers/__init__.py # Model registry (15 models, 10 providers)
│   ├── router/
│   │   ├── __init__.py       # Content analyzer (tier classification)
│   │   └── strategy.py       # Router + cost ledger + fallbacks
│   ├── cache/__init__.py     # Delta context buffer
│   ├── middleware/__init__.py # Auth + cost tracking
│   └── watch/__init__.py     # File watcher client
├── tests/
│   ├── test_analyzer.py      # 20+ tier classification tests
│   ├── test_router.py        # Router strategy + budget tests
│   ├── test_delta_buffer.py  # Delta computation tests
│   └── test_e2e.py           # Full HTTP API tests
├── docker-compose.yml        # Development (proxy + redis + ollama)
├── docker-compose.prod.yml   # Production (+ traefik)
├── Dockerfile                # Standard build
├── Dockerfile.jetson         # Jetson Orin (ARM64 + CUDA)
├── quadlet/                  # Podman systemd integration
├── traefik/                  # Reverse proxy config
└── scripts/
    ├── setup.sh              # First-time setup
    └── jetson-entrypoint.sh  # Jetson startup script

License

Apache License 2.0 - see LICENSE for details.

Author

Created by Tom Sapletta - tom@sapletta.com

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proxym-0.1.2.tar.gz (63.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proxym-0.1.2-py3-none-any.whl (45.5 kB view details)

Uploaded Python 3

File details

Details for the file proxym-0.1.2.tar.gz.

File metadata

  • Download URL: proxym-0.1.2.tar.gz
  • Upload date:
  • Size: 63.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for proxym-0.1.2.tar.gz
Algorithm Hash digest
SHA256 17fa62a4255e21802777399842b19ad447af8ec59c996bcefd1b415e191bd2eb
MD5 1c713b376f9c9379a095eb9ce0a74fea
BLAKE2b-256 8a7c42d0b348a7ee625e6ab4d170524004a294aa463342ecfc1cfaf4c76e5858

See more details on using hashes here.

File details

Details for the file proxym-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: proxym-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 45.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for proxym-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2b2a8ea48d4fcdbf5ac698ce0487077476b44b0bcf0362443625a85f1c9483f4
MD5 b1f1b48232f78d7550aa3a9e1b4d8aa6
BLAKE2b-256 c178e274d3f5b015eadc4ca0cd979cfcd454f33e7d616da04da370ab8f2ca5a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page