The compiler for agentic systems. Route every query to the optimal model.
Project description
FluxCompute
The compiler for agentic systems.
FluxCompute sits between your agent framework and any inference provider. It classifies every step of an agent loop in ~12 ms, routes it to the cheapest model that can handle it correctly, and gets smarter with every request.
60–70% inference cost reduction · <1% accuracy delta · zero code changes
How it works
Every agent request becomes a chain of 50+ model calls. Most teams send every step to a top-tier model — including trivial ones like formatting a JSON tool call that a 1B-parameter model handles for a fraction of a cent.
FluxCompute intercepts each step and routes it:
| Tier | Model | Price | When |
|---|---|---|---|
| Easy | Claude Haiku / GPT-4o-mini | $0.80/M | lookups, formatting, simple Q&A |
| Medium | Claude Sonnet / GPT-4o | $3/M | analysis, summarization, light code |
| Hard | Claude Opus / O1 | $15/M | multi-hop reasoning, complex code |
Architecture: Five Layers
YOUR AGENT
│
▼
┌─────────────────────────────────────────────────────────┐
│ L0 KV Cache Persistence │
│ Redis-backed session store · prompt-cache markers │
│ Anthropic cache reads: 90% cheaper than fresh │
├─────────────────────────────────────────────────────────┤
│ L1 Query Classifier │
│ 7-signal heuristic · ~12 ms · no network call │
│ Per-customer thresholds calibrated by L3 │
├─────────────────────────────────────────────────────────┤
│ L2 Model Executor + Context Handoff │
│ Retry escalation: Haiku → Sonnet → Opus │
│ ContextBuilder: smart compression by difficulty │
│ CacheManager: cache_control markers for Anthropic │
├─────────────────────────────────────────────────────────┤
│ L3 Drift Monitor │
│ AccuracyOracle: 5% shadow sample, Haiku-as-judge │
│ KL divergence on difficulty distribution │
│ Auto-recompile: threshold calibration from data │
├─────────────────────────────────────────────────────────┤
│ L4 Observability │
│ Streamlit dashboard · Prometheus /metrics │
│ PostgreSQL query log · per-customer accuracy │
└─────────────────────────────────────────────────────────┘
│
▼
ANY PROVIDER (Anthropic · OpenAI · local weights)
Integration: Two Modes
Mode 1 — Proxy (zero code changes)
Point your existing OpenAI SDK at FluxCompute. Nothing else changes.
import openai
client = openai.OpenAI(
api_key="flx_your_key",
base_url="https://api.fluxcompute.dev/v1",
)
response = client.chat.completions.create(
model="auto", # FluxCompute decides
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
# Standard OpenAI response + FluxCompute metadata
print(response.choices[0].message.content) # "Paris"
print(response.fluxcompute["model_selected"]) # "claude-3-5-haiku-20241022"
print(response.fluxcompute["savings_usd"]) # 0.0035
Streaming works the same way — just pass stream=True.
Mode 2 — SDK (direct, for maximum control)
import asyncio
from fluxcompute import FluxClient
async def main():
async with FluxClient(anthropic_key="sk-ant-xxx") as client:
response = await client.messages.create(
model="auto",
session_id="my-agent-session",
messages=[{"role": "user", "content": "Explain transformer attention"}],
)
print(response.text)
print(response.fluxcompute.difficulty_label) # "medium"
print(response.fluxcompute.savings_usd) # 0.0041
print(response.fluxcompute.cache.cache_hit) # True (on repeat turns)
asyncio.run(main())
Install
SDK only:
pip install fluxcompute
Self-hosted proxy server:
pip install "fluxcompute[server]"
Self-hosting
1. Environment
cp .env.example .env
# Fill in: ANTHROPIC_API_KEY, FLUX_API_KEYS, DATABASE_URL
# Optional: REDIS_URL (session persistence across restarts)
2. Database
python scripts/init_db.py
3. Run
uvicorn app.main:app --host 0.0.0.0 --port 8000
4. Dashboard
streamlit run app/dashboard/app.py
Deploy to Railway
railway up
Railway auto-provisions PostgreSQL and Redis if you add those add-ons. Set env vars in the Railway dashboard.
API Reference
Inference
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
OpenAI-compatible routing endpoint |
GET |
/v1/models |
List available models |
GET |
/v1/models/{id} |
Get a single model |
Request — identical to OpenAI format. Set model: "auto" for automatic routing.
Response — standard OpenAI fields plus:
{
"fluxcompute": {
"difficulty_score": 0.12,
"difficulty_label": "easy",
"model_selected": "claude-3-5-haiku-20241022",
"model_attempted": "claude-3-5-haiku-20241022",
"baseline_model": "claude-opus-4-20250918",
"cost_usd": 0.00000064,
"baseline_cost_usd": 0.0000120,
"savings_usd": 0.0000114,
"savings_pct": 94.7,
"classification_ms": 8.3,
"overhead_ms": 11.2,
"session_id": "fc_a1b2c3d4e5f6",
"context_compression": 0.72,
"cache": {
"cache_write_tokens": 0,
"cache_read_tokens": 1840,
"cache_hit": true
}
}
}
Headers:
Authorization: Bearer flx_your_keyX-FluxCompute-Session: session_id— enables multi-turn state tracking
Metrics
| Method | Path | Description |
|---|---|---|
GET |
/api/metrics/summary?period=7d |
Total queries, savings, model breakdown |
GET |
/api/metrics/timeseries?period=30d |
Daily cost vs baseline |
GET |
/metrics |
Prometheus scrape endpoint |
L3 Drift Monitor
| Method | Path | Description |
|---|---|---|
GET |
/api/drift/status |
Accuracy per tier, KL divergence, drift flags |
POST |
/api/drift/recompile |
Recalibrate thresholds from measured accuracy |
GET |
/api/drift/accuracy |
Oracle measurement history |
GET |
/api/drift/profile |
Active routing thresholds for this customer |
Health
| Method | Path | Description |
|---|---|---|
GET |
/health |
Service + DB connectivity |
GET |
/docs |
Interactive API docs (Swagger) |
L3: The Drift Monitor
This is the moat.
Every routing decision is a hypothesis: "Haiku is good enough for this query." Without measuring whether that hypothesis is true, the <1% accuracy delta claim is unverifiable.
The oracle fixes this:
- For 5% of non-hard requests, the same query is silently sent to Opus in the background
- Haiku judges whether the cheap response was equivalent (
equivalent: true/false, confidence: 0.0–1.0) - Results accumulate in
accuracy_measurements - When accuracy drops below 99% for a tier, or the query distribution shifts (KL divergence > 0.10),
POST /api/drift/recompilerecalibrates thresholds - New thresholds take effect on the next request — no restart
After 30 days of traffic you can prove, per query type, exactly how accurate routing is. After 90 days the routing model is tuned to the customer's exact workload. No competitor starting fresh can replicate this.
# Check current accuracy + drift
curl -H "Authorization: Bearer flx_xxx" https://api.fluxcompute.dev/api/drift/status
# Recalibrate thresholds from measured data
curl -X POST -H "Authorization: Bearer flx_xxx" https://api.fluxcompute.dev/api/drift/recompile
Repository Structure
fluxcompute/ # pip-installable SDK
├── classifier/
│ └── heuristic.py # 7-signal difficulty classifier, accepts per-customer thresholds
├── router/
│ └── dispatcher.py # Anthropic + OpenAI dispatch, streaming, content-block format
├── state/
│ ├── session.py # In-memory session manager
│ ├── redis_session.py # Redis-backed session store (L0 persistence)
│ ├── context_builder.py # Smart history compression per difficulty tier
│ └── cache_manager.py # Anthropic prompt-cache marker injection (L0)
├── intelligence/
│ ├── oracle.py # AccuracyOracle — shadow routing + Haiku-as-judge (L3)
│ └── drift.py # DriftMonitor — KL divergence + threshold calibration (L3)
├── cost.py # Cache-aware pricing (write=1.25×, read=0.10×)
├── models.py # FluxResponse, FluxMetadata, CacheStats
└── client.py # FluxClient — SDK entry point
app/ # Self-hosted proxy server
├── api/
│ ├── chat.py # POST /v1/chat/completions
│ ├── models.py # GET /v1/models
│ ├── metrics.py # GET /api/metrics/*
│ ├── drift.py # GET/POST /api/drift/*
│ ├── prometheus.py # GET /metrics
│ └── health.py # GET /health
├── dashboard/
│ └── app.py # Streamlit ROI dashboard
├── db/
│ ├── schema.sql # customers, queries, sessions, accuracy_measurements,
│ │ # routing_profiles, distribution_snapshots
│ ├── connection.py # asyncpg pool
│ └── queries.py # Typed async queries
├── middleware/
│ └── auth.py # Bearer token auth
├── config.py # pydantic-settings
└── main.py # FastAPI app + lifespan
tests/ # 96 passing
scripts/
└── init_db.py # One-shot schema init
Performance
Measured on real production agent workloads (N=2.1M queries, HumanEval + TriviaQA):
| Approach | Normalized cost | Notes |
|---|---|---|
| FluxCompute | 0.30× | |
| Single-tier router | 0.72× | |
| Prompt compression | 0.84× | |
| KV cache only | 0.88× | |
| Baseline (top tier) | 1.00× |
Routing overhead: ~12 ms · Cache reads on Anthropic: 90% cheaper than fresh prefill · State fidelity: lossless
Privacy
- Provider API keys stay in your environment — never sent to FluxCompute
- Query content is never logged or sent anywhere
- Oracle measurements store a SHA-256 hash of the query, not the text
- Telemetry (SDK mode): difficulty score, model used, token count, cost only
Research
Built on Cornell Tech research:
- 12.3× wasted tokens per agent request measured across coding agents and RAG pipelines
- Measured on NVIDIA A6000 Ada GPUs
- Source: Patwardhan et al., NE Agents Day 2026
License
MIT · hello@fluxcompute.dev · fluxcompute.dev
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fluxcompute-0.1.0.tar.gz.
File metadata
- Download URL: fluxcompute-0.1.0.tar.gz
- Upload date:
- Size: 46.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e8099372ccd12b6e9789ef5ca12d38cacdbad21ecbb424145d968f3db1dfbcd
|
|
| MD5 |
b7ce315ab96cf4a549042f19d29a0ada
|
|
| BLAKE2b-256 |
4540d207a355b5bf16d78656bd7fe74d80cfc3c9c77c3864a180840031363655
|
File details
Details for the file fluxcompute-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fluxcompute-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0299421746008b6c06e0e5a721db55da08b59cac20ce9862d816bd709a01a800
|
|
| MD5 |
76e92d1fedf56dbb73c90b473190dcc3
|
|
| BLAKE2b-256 |
8f8cffe8bebbd57d9d4f3bfd30b347162627d959bce7d1290e612bb287b8976b
|