Intelligent AI proxy with multi-provider routing, semantic caching, and delta context buffers
Project description
AI Proxy — Intelligent Multi-Provider LLM Gateway
Lokalne proxy łączące 10 providerów, 15 modeli, NVIDIA Jetson Orin, i delta context buffer w jednym OpenAI-compatible API. Budżet $20–60/mies. zamiast $150+.
┌─────────────────────────────────────────────────┐
│ IDE (Roo Code / Cline / Continue.dev / Aider) │
│ ↓ localhost:4000 │
├─────────────────────────────────────────────────┤
│ AI Proxy (FastAPI) │
│ ┌─────────┐ ┌────────────┐ ┌───────────────┐ │
│ │Analyzer │→│ Router │→│ LiteLLM │ │
│ │(tier+ │ │(cost+ │ │(10 providers │ │
│ │ caps) │ │ fallbacks) │ │ 15 models) │ │
│ └─────────┘ └────────────┘ └───────────────┘ │
│ ↑ ↑ ↑ │
│ Delta Buffer Redis Cache Budget Ledger │
├─────────────────────────────────────────────────┤
│ Ollama (Jetson Orin / GPU / CPU) │
└─────────────────────────────────────────────────┘
Features
- Content-based routing — analyzes your prompt to pick the cheapest model that can handle the task (Opus 4.6 for architecture, Haiku 4.5 for typos)
- 10 providers, 15 models — Anthropic, OpenAI, Google, DeepSeek, Groq, OpenRouter, Mistral, Together, Fireworks, Cerebras + local Ollama
- Delta context buffer — watches
code2llmoutput and sends only file diffs, not full context (saves 60–80% tokens) - Budget enforcement — daily/monthly USD limits with per-request caps
- Fallback chains — if Anthropic is rate-limited, auto-fallback to OpenAI → DeepSeek → local
- OpenAI-compatible API — drop-in replacement for any tool expecting OpenAI format
- Docker + Podman + Quadlet — development, production, and systemd-native deployments
Quick Start
Option A: Local Python
git clone https://github.com/wronai/ai-proxy && cd ai-proxy
bash scripts/setup.sh
# Edit .env with your API keys
ai-proxy
Option B: Docker Compose
cp .env.example .env
# Edit .env with your API keys
docker compose up -d
Option C: Jetson Orin
docker build -f Dockerfile.jetson -t ai-proxy:jetson .
docker run --runtime nvidia --gpus all \
-p 4000:4000 -p 11434:11434 \
--env-file .env \
ai-proxy:jetson
Test it
curl http://localhost:4000/health
curl http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-proxy-local-dev" \
-H "Content-Type: application/json" \
-d '{
"model": "balanced",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Model Routing Strategy
The proxy analyzes each prompt and picks the optimal model:
| Task Type | Tier | Model Selected | Cost/1M tokens |
|---|---|---|---|
| "Fix this typo" | trivial | Cerebras Llama 70B / DeepSeek V3 | $0.27–$0.60 |
| "What does this function do?" | operational | Haiku 4.5 / Gemini Flash | $0.15–$1.00 |
| "Implement a REST endpoint" | standard | Sonnet 4.6 / GPT-4.1 | $2.00–$3.00 |
| "Refactor auth across 20 files" | complex | Sonnet 4.6 / Gemini Pro | $3.00–$10.00 |
| "Debug this race condition step by step" | deep | Opus 4.6 / DeepSeek R1 | $0.55–$5.00 |
Model Aliases
Use these as the model parameter for explicit routing:
| Alias | Routes To | When to Use |
|---|---|---|
cheap |
Haiku 4.5 | Debug, validation, simple Q&A |
balanced |
Sonnet 4.6 | Default coding, implementation |
premium |
Opus 4.6 | Complex refactoring, architecture |
free |
Gemini 2.5 Flash | Planning, analysis (free tier) |
local |
Qwen 3B (Ollama) | Offline, privacy, autocomplete |
Automatic Routing
Without a model alias, the proxy analyzes your message:
# Automatically routes to cheap model
curl -X POST localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $KEY" \
-d '{"messages": [{"role": "user", "content": "What is a for loop?"}]}'
# Automatically routes to premium model
curl -X POST localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $KEY" \
-d '{"messages": [{"role": "user", "content": "Refactor the entire auth module to microservices"}]}'
# Force a tier with header
curl -X POST localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $KEY" \
-H "X-Task-Tier: deep" \
-d '{"messages": [{"role": "user", "content": "Why does this deadlock?"}]}'
Delta Context Buffer
The proxy maintains a buffer of your project files (from code2llm output) and sends only diffs to the LLM, dramatically reducing token usage.
Setup
# Terminal 1: Generate code2llm output
pip install code2llm
code2llm ./ -f all -o ./project --no-chunk
# Terminal 2: Start the watcher
ai-proxy-client --watch ./project --proxy http://localhost:4000
# Terminal 3: Query with context injection
curl -X POST localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $KEY" \
-H "X-Inject-Context: true" \
-d '{"messages": [{"role": "user", "content": "Explain the auth module"}]}'
How It Works
code2llmgenerates project analysis files in./project/ai-proxy-clientwatches the directory withwatchfiles- On change, it computes a unified diff against the last-sent snapshot
- Only changed portions are sent to the proxy as a
<context_delta>block - When you add
X-Inject-Context: true, the delta is injected into the system prompt
Before (full context every request): ~120K tokens × $3/1M = $0.36/request After (delta only): ~5K tokens × $3/1M = $0.015/request → 96% savings
IDE Integration
Roo Code
Settings → Provider: OpenAI Compatible
- API Base:
http://localhost:4000 - API Key: (your master key from .env)
- Sticky Models per mode:
- Architect →
free - Code →
balanced - Debug →
cheap - Custom Opus →
premium
- Architect →
Cline / Continue.dev / Aider
Same pattern: point API base to http://localhost:4000 with your master key.
Deployment
Docker Compose (Development)
docker compose up -d # proxy + redis + ollama
docker compose logs -f proxy # watch logs
Docker Compose + Traefik (Production)
docker compose -f docker-compose.prod.yml up -d
# Access via https://ai-proxy.local
Podman Quadlet (Systemd-native)
# Copy quadlet files
mkdir -p ~/.config/containers/systemd
cp quadlet/*.container quadlet/*.network ~/.config/containers/systemd/
# Build and tag image
podman build -t localhost/ai-proxy:latest .
# Create config dir
mkdir -p ~/.config/ai-proxy
cp .env ~/.config/ai-proxy/.env
# Enable and start
systemctl --user daemon-reload
systemctl --user start ai-proxy
systemctl --user status ai-proxy
Jetson Orin
The Jetson Dockerfile bundles Ollama + AI Proxy in a single container:
docker build -f Dockerfile.jetson -t ai-proxy:jetson .
docker run --runtime nvidia --gpus all \
-p 4000:4000 -p 11434:11434 \
-v ~/ollama:/root/.ollama \
--env-file .env \
ai-proxy:jetson
Models available on Jetson Orin 8GB:
qwen2.5-coder:1.5b— autocomplete (~1GB, ~30 tok/s)qwen2.5-coder:3b— code generation (~2GB, ~18 tok/s)phi3:3.8b— general tasks (~2.5GB, ~15 tok/s)
API Reference
POST /v1/chat/completions
OpenAI-compatible. Extra features:
| Header | Description |
|---|---|
X-Task-Tier |
Force tier: trivial|operational|standard|complex|deep |
X-Inject-Context |
true to inject latest code2llm delta |
Response includes _proxy metadata:
{
"choices": [...],
"_proxy": {
"model_id": "anthropic/sonnet-4.6",
"tier": "standard",
"cost_usd": 0.000045,
"routing_reason": "tier=standard, cost=$0.0000",
"elapsed_ms": 1234.5,
"fallback_index": 0
}
}
GET /v1/models
List all available models with pricing and capabilities.
GET /v1/budget
Current spend vs. limits.
POST /v1/context/delta
Receive context delta from the watcher client.
GET /v1/context/stats
Delta buffer statistics.
Testing
# Unit tests (no external services needed)
pytest tests/ --ignore=tests/test_e2e.py -v
# E2E tests (mock LiteLLM, no real API calls)
pytest tests/test_e2e.py -v -m e2e
# All tests with coverage
pytest tests/ -v --cov=ai_proxy --cov-report=html
Budget Examples
$25/month (casual, 4h/day)
DAILY_BUDGET_USD=1.5
MONTHLY_BUDGET_USD=25
Autocomplete: Ollama local ($0) → Planning: Gemini free ($0) → Coding: Sonnet 4.6 ($15) → Complex: skip Opus, use DeepSeek R1 ($5)
$60/month (intensive, 8h/day)
DAILY_BUDGET_USD=3.0
MONTHLY_BUDGET_USD=60
Full model spectrum with Opus 4.6 for 2–3 complex tasks/week.
Project Structure
ai-proxy/
├── src/ai_proxy/
│ ├── main.py # FastAPI app + OpenAI-compatible endpoint
│ ├── config.py # Pydantic settings from .env
│ ├── providers/__init__.py # Model registry (15 models, 10 providers)
│ ├── router/
│ │ ├── __init__.py # Content analyzer (tier classification)
│ │ └── strategy.py # Router + cost ledger + fallbacks
│ ├── cache/__init__.py # Delta context buffer
│ ├── middleware/__init__.py # Auth + cost tracking
│ └── watch/__init__.py # File watcher client
├── tests/
│ ├── test_analyzer.py # 20+ tier classification tests
│ ├── test_router.py # Router strategy + budget tests
│ ├── test_delta_buffer.py # Delta computation tests
│ └── test_e2e.py # Full HTTP API tests
├── docker-compose.yml # Development (proxy + redis + ollama)
├── docker-compose.prod.yml # Production (+ traefik)
├── Dockerfile # Standard build
├── Dockerfile.jetson # Jetson Orin (ARM64 + CUDA)
├── quadlet/ # Podman systemd integration
├── traefik/ # Reverse proxy config
└── scripts/
├── setup.sh # First-time setup
└── jetson-entrypoint.sh # Jetson startup script
License
Apache License 2.0 - see LICENSE for details.
Author
Created by Tom Sapletta - tom@sapletta.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proxym-0.1.2.tar.gz.
File metadata
- Download URL: proxym-0.1.2.tar.gz
- Upload date:
- Size: 63.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17fa62a4255e21802777399842b19ad447af8ec59c996bcefd1b415e191bd2eb
|
|
| MD5 |
1c713b376f9c9379a095eb9ce0a74fea
|
|
| BLAKE2b-256 |
8a7c42d0b348a7ee625e6ab4d170524004a294aa463342ecfc1cfaf4c76e5858
|
File details
Details for the file proxym-0.1.2-py3-none-any.whl.
File metadata
- Download URL: proxym-0.1.2-py3-none-any.whl
- Upload date:
- Size: 45.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b2a8ea48d4fcdbf5ac698ce0487077476b44b0bcf0362443625a85f1c9483f4
|
|
| MD5 |
b1f1b48232f78d7550aa3a9e1b4d8aa6
|
|
| BLAKE2b-256 |
c178e274d3f5b015eadc4ca0cd979cfcd454f33e7d616da04da370ab8f2ca5a6
|