Skip to main content

The compiler for agentic systems. Route every query to the optimal model.

Project description

FluxCompute

The compiler for agentic systems.

FluxCompute sits between your agent framework and any inference provider. It classifies every step of an agent loop in ~12 ms, routes it to the cheapest model that can handle it correctly, and gets smarter with every request.

60–70% inference cost reduction · <1% accuracy delta · zero code changes

How it works

Every agent request becomes a chain of 50+ model calls. Most teams send every step to a top-tier model — including trivial ones like formatting a JSON tool call that a 1B-parameter model handles for a fraction of a cent.

FluxCompute intercepts each step and routes it:

Tier Model Price When
Easy Claude Haiku / GPT-4o-mini $0.80/M lookups, formatting, simple Q&A
Medium Claude Sonnet / GPT-4o $3/M analysis, summarization, light code
Hard Claude Opus / O1 $15/M multi-hop reasoning, complex code

Architecture: Five Layers

YOUR AGENT
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  L0  KV Cache Persistence                               │
│      Redis-backed session store · prompt-cache markers  │
│      Anthropic cache reads: 90% cheaper than fresh      │
├─────────────────────────────────────────────────────────┤
│  L1  Query Classifier                                   │
│      7-signal heuristic · ~12 ms · no network call      │
│      Per-customer thresholds calibrated by L3           │
├─────────────────────────────────────────────────────────┤
│  L2  Model Executor + Context Handoff                   │
│      Retry escalation: Haiku → Sonnet → Opus            │
│      ContextBuilder: smart compression by difficulty    │
│      CacheManager: cache_control markers for Anthropic  │
├─────────────────────────────────────────────────────────┤
│  L3  Drift Monitor                                      │
│      AccuracyOracle: 5% shadow sample, Haiku-as-judge   │
│      KL divergence on difficulty distribution           │
│      Auto-recompile: threshold calibration from data    │
├─────────────────────────────────────────────────────────┤
│  L4  Observability                                      │
│      Streamlit dashboard · Prometheus /metrics          │
│      PostgreSQL query log · per-customer accuracy       │
└─────────────────────────────────────────────────────────┘
    │
    ▼
ANY PROVIDER  (Anthropic · OpenAI · local weights)

Integration: Two Modes

Mode 1 — Proxy (zero code changes)

Point your existing OpenAI SDK at FluxCompute. Nothing else changes.

import openai

client = openai.OpenAI(
    api_key="flx_your_key",
    base_url="https://api.fluxcompute.dev/v1",
)

response = client.chat.completions.create(
    model="auto",   # FluxCompute decides
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)

# Standard OpenAI response + FluxCompute metadata
print(response.choices[0].message.content)    # "Paris"
print(response.fluxcompute["model_selected"]) # "claude-3-5-haiku-20241022"
print(response.fluxcompute["savings_usd"])    # 0.0035

Streaming works the same way — just pass stream=True.

Mode 2 — SDK (direct, for maximum control)

import asyncio
from fluxcompute import FluxClient

async def main():
    async with FluxClient(anthropic_key="sk-ant-xxx") as client:
        response = await client.messages.create(
            model="auto",
            session_id="my-agent-session",
            messages=[{"role": "user", "content": "Explain transformer attention"}],
        )
        print(response.text)
        print(response.fluxcompute.difficulty_label)   # "medium"
        print(response.fluxcompute.savings_usd)        # 0.0041
        print(response.fluxcompute.cache.cache_hit)    # True (on repeat turns)

asyncio.run(main())

Install

SDK only:

pip install fluxcompute

Self-hosted proxy server:

pip install "fluxcompute[server]"

Self-hosting

1. Environment

cp .env.example .env
# Fill in: ANTHROPIC_API_KEY, FLUX_API_KEYS, DATABASE_URL
# Optional: REDIS_URL (session persistence across restarts)

2. Database

python scripts/init_db.py

3. Run

uvicorn app.main:app --host 0.0.0.0 --port 8000

4. Dashboard

streamlit run app/dashboard/app.py

Deploy to Railway

railway up

Railway auto-provisions PostgreSQL and Redis if you add those add-ons. Set env vars in the Railway dashboard.


API Reference

Inference

Method Path Description
POST /v1/chat/completions OpenAI-compatible routing endpoint
GET /v1/models List available models
GET /v1/models/{id} Get a single model

Request — identical to OpenAI format. Set model: "auto" for automatic routing.

Response — standard OpenAI fields plus:

{
  "fluxcompute": {
    "difficulty_score": 0.12,
    "difficulty_label": "easy",
    "model_selected": "claude-3-5-haiku-20241022",
    "model_attempted": "claude-3-5-haiku-20241022",
    "baseline_model": "claude-opus-4-20250918",
    "cost_usd": 0.00000064,
    "baseline_cost_usd": 0.0000120,
    "savings_usd": 0.0000114,
    "savings_pct": 94.7,
    "classification_ms": 8.3,
    "overhead_ms": 11.2,
    "session_id": "fc_a1b2c3d4e5f6",
    "context_compression": 0.72,
    "cache": {
      "cache_write_tokens": 0,
      "cache_read_tokens": 1840,
      "cache_hit": true
    }
  }
}

Headers:

  • Authorization: Bearer flx_your_key
  • X-FluxCompute-Session: session_id — enables multi-turn state tracking

Metrics

Method Path Description
GET /api/metrics/summary?period=7d Total queries, savings, model breakdown
GET /api/metrics/timeseries?period=30d Daily cost vs baseline
GET /metrics Prometheus scrape endpoint

L3 Drift Monitor

Method Path Description
GET /api/drift/status Accuracy per tier, KL divergence, drift flags
POST /api/drift/recompile Recalibrate thresholds from measured accuracy
GET /api/drift/accuracy Oracle measurement history
GET /api/drift/profile Active routing thresholds for this customer

Health

Method Path Description
GET /health Service + DB connectivity
GET /docs Interactive API docs (Swagger)

L3: The Drift Monitor

This is the moat.

Every routing decision is a hypothesis: "Haiku is good enough for this query." Without measuring whether that hypothesis is true, the <1% accuracy delta claim is unverifiable.

The oracle fixes this:

  1. For 5% of non-hard requests, the same query is silently sent to Opus in the background
  2. Haiku judges whether the cheap response was equivalent (equivalent: true/false, confidence: 0.0–1.0)
  3. Results accumulate in accuracy_measurements
  4. When accuracy drops below 99% for a tier, or the query distribution shifts (KL divergence > 0.10), POST /api/drift/recompile recalibrates thresholds
  5. New thresholds take effect on the next request — no restart

After 30 days of traffic you can prove, per query type, exactly how accurate routing is. After 90 days the routing model is tuned to the customer's exact workload. No competitor starting fresh can replicate this.

# Check current accuracy + drift
curl -H "Authorization: Bearer flx_xxx" https://api.fluxcompute.dev/api/drift/status

# Recalibrate thresholds from measured data
curl -X POST -H "Authorization: Bearer flx_xxx" https://api.fluxcompute.dev/api/drift/recompile

Repository Structure

fluxcompute/              # pip-installable SDK
├── classifier/
│   └── heuristic.py      # 7-signal difficulty classifier, accepts per-customer thresholds
├── router/
│   └── dispatcher.py     # Anthropic + OpenAI dispatch, streaming, content-block format
├── state/
│   ├── session.py        # In-memory session manager
│   ├── redis_session.py  # Redis-backed session store (L0 persistence)
│   ├── context_builder.py # Smart history compression per difficulty tier
│   └── cache_manager.py  # Anthropic prompt-cache marker injection (L0)
├── intelligence/
│   ├── oracle.py         # AccuracyOracle — shadow routing + Haiku-as-judge (L3)
│   └── drift.py          # DriftMonitor — KL divergence + threshold calibration (L3)
├── cost.py               # Cache-aware pricing (write=1.25×, read=0.10×)
├── models.py             # FluxResponse, FluxMetadata, CacheStats
└── client.py             # FluxClient — SDK entry point

app/                      # Self-hosted proxy server
├── api/
│   ├── chat.py           # POST /v1/chat/completions
│   ├── models.py         # GET /v1/models
│   ├── metrics.py        # GET /api/metrics/*
│   ├── drift.py          # GET/POST /api/drift/*
│   ├── prometheus.py     # GET /metrics
│   └── health.py         # GET /health
├── dashboard/
│   └── app.py            # Streamlit ROI dashboard
├── db/
│   ├── schema.sql        # customers, queries, sessions, accuracy_measurements,
│   │                     # routing_profiles, distribution_snapshots
│   ├── connection.py     # asyncpg pool
│   └── queries.py        # Typed async queries
├── middleware/
│   └── auth.py           # Bearer token auth
├── config.py             # pydantic-settings
└── main.py               # FastAPI app + lifespan

tests/                    # 96 passing
scripts/
└── init_db.py            # One-shot schema init

Performance

Measured on real production agent workloads (N=2.1M queries, HumanEval + TriviaQA):

Approach Normalized cost Notes
FluxCompute 0.30×
Single-tier router 0.72×
Prompt compression 0.84×
KV cache only 0.88×
Baseline (top tier) 1.00×

Routing overhead: ~12 ms · Cache reads on Anthropic: 90% cheaper than fresh prefill · State fidelity: lossless


Privacy

  • Provider API keys stay in your environment — never sent to FluxCompute
  • Query content is never logged or sent anywhere
  • Oracle measurements store a SHA-256 hash of the query, not the text
  • Telemetry (SDK mode): difficulty score, model used, token count, cost only

Research

Built on Cornell Tech research:

  • 12.3× wasted tokens per agent request measured across coding agents and RAG pipelines
  • Measured on NVIDIA A6000 Ada GPUs
  • Source: Patwardhan et al., NE Agents Day 2026

License

MIT · hello@fluxcompute.dev · fluxcompute.dev

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fluxcompute-0.1.0.tar.gz (46.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fluxcompute-0.1.0-py3-none-any.whl (35.2 kB view details)

Uploaded Python 3

File details

Details for the file fluxcompute-0.1.0.tar.gz.

File metadata

  • Download URL: fluxcompute-0.1.0.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for fluxcompute-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8e8099372ccd12b6e9789ef5ca12d38cacdbad21ecbb424145d968f3db1dfbcd
MD5 b7ce315ab96cf4a549042f19d29a0ada
BLAKE2b-256 4540d207a355b5bf16d78656bd7fe74d80cfc3c9c77c3864a180840031363655

See more details on using hashes here.

File details

Details for the file fluxcompute-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fluxcompute-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for fluxcompute-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0299421746008b6c06e0e5a721db55da08b59cac20ce9862d816bd709a01a800
MD5 76e92d1fedf56dbb73c90b473190dcc3
BLAKE2b-256 8f8cffe8bebbd57d9d4f3bfd30b347162627d959bce7d1290e612bb287b8976b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page