Budget-Aware Agentic Routing — route LLM calls intelligently between cheap and powerful models with a hard budget cap.
Project description
Baar-Core
Stop LLM API calls before they happen. Not after.
pip install baar-core
I left an agent loop running overnight. Woke up to a $47 bill — 20,000 GPT-4o tokens answering "what time is it?" queries.
Baar-Core would have stopped it at $0.10. Before the first overage call. No network request made. $0 spent.
from baar import BAARRouter
router = BAARRouter(budget=0.10) # hard cap: $0.10 total
router.chat("What time is it?") # → cheap model, ~$0.0001
router.chat("Write a CUDA matmul kernel") # → capable model if budget allows
# budget exhausted → raises BudgetExhausted, zero API calls made
The problem with every other solution
Most cost tools track spend after the fact. You get an alert when the bill is already large.
LiteLLM's budget manager, Portkey rate limits, provider spend alerts — they all tell you what happened. They don't stop it mid-flight.
Baar-Core is a local kill-switch. Before each call, it estimates the cost. If the remaining budget is too low, it raises an exception locally — no DNS lookup, no TCP connection, no token consumed. The call never leaves your machine.
How it works
User task
│
▼
┌─────────────────────────────────┐
│ Pre-flight budget check │ ← if estimated cost > remaining budget
│ (local, zero network) │ raise BudgetExhausted immediately
└────────────┬────────────────────┘
│ affordable
▼
┌─────────────────────────────────┐
│ Semantic complexity router │ ← cheap LLM scores complexity 0.0–1.0
│ (gpt-4o-mini, ~$0.000015/call) │
└────────────┬────────────────────┘
│
┌──────┴───────┐
│ │
simple complex
│ │
▼ ▼
Cheap model Budget check
(fast, $) ├─ affordable → Capable model ($$$)
└─ too close → Downgrade to cheap model ($)
- Pre-flight check — Estimates cost locally before any network call. Kills the request if it would overshoot.
- Semantic routing — A fast, cheap model scores task complexity. Not keyword matching — actual semantic understanding.
- Budget-aware downgrade — Running low? Hard tasks automatically fall back to the cheaper model so the turn still completes.
Quick start
from baar import BAARRouter, BudgetExhausted
# Basic usage
router = BAARRouter(budget=0.10)
reply = router.chat("Explain recursion with a Python example")
print(reply)
print(f"Spent: ${router.spent:.5f} / Remaining: ${router.remaining:.5f}")
# Multi-step with a report
log = router.run([
"What is 42 * 17?",
"Translate 'good morning' to Japanese",
"Design a distributed rate-limiter for 100k RPS — include trade-offs",
"Convert 72°F to Celsius",
])
log.print_report()
# Async
import asyncio
async def main():
router = BAARRouter(budget=0.05)
reply = await router.achat("Summarize the CAP theorem")
print(reply)
asyncio.run(main())
# Kill-switch in action
router = BAARRouter(budget=0.00001)
try:
router.chat("Any prompt at all")
except BudgetExhausted as e:
print(f"Blocked before API call. Remaining: ${e.remaining:.6f}")
# Zero network calls made. $0 spent.
Works with any LiteLLM-supported provider: OpenAI, Anthropic, Groq, Together, Ollama, OpenRouter, Azure, and more.
Real-world examples
| Example | Use case |
|---|---|
| fastapi_per_user_budget.py | SaaS: per-user $0.10 quota with SQLite persistence |
| agent_loop.py | Autonomous agent loop with graceful budget stop |
| streaming.py | Streaming responses with live budget tracking |
| multi_tenant.py | Concurrent multi-user budget isolation, quota report |
| basic_usage.py | Getting started |
Persistent budgets (survive process restarts)
By default, budgets are in-memory. For production, plug in a persistent store:
from baar import BAARRouter
from baar.core.stores import SQLiteBudgetStore, FileBudgetStore
# Per-user quota in a SQLite database — thread-safe, no extra deps
router = BAARRouter(
budget=0.10,
store=SQLiteBudgetStore("budgets.db", namespace="user_alice"),
)
# Restarts don't reset the budget — spend is loaded from disk
router.chat("Hello") # deducted from Alice's persistent $0.10
# JSON file — good for single-process scripts
router = BAARRouter(
budget=1.00,
store=FileBudgetStore("my_budget.json"),
)
Benchmarks
Routing cost benchmark — mock mode
Mock mode runs the full routing pipeline with simulated completions to measure routing overhead and cost allocation without spending real money. Use this to tune thresholds before a live run.
Note: Accuracy figures in mock mode reflect simulated task responses, not real model capability. Use live mode for accuracy measurement. The cost figures and routing split percentages are the meaningful outputs here.
baar-bench --dataset all --limit 200 --budget 10 --mock \
--complexity-threshold 0.80 --coding-threshold 0.75 --seed 42
| Dataset | Strategy | % routed to cheap | Total cost | Savings vs always-big |
|---|---|---|---|---|
| MMLU | Always big | 0% | $1.0005 | — |
| MMLU | Baar-Core | 81% | $0.157 | 84.3% cheaper |
| GSM8K | Always big | 0% | $1.0005 | — |
| GSM8K | Baar-Core | 87% | $0.129 | 87.1% cheaper |
| HumanEval | Always big | 0% | $1.0005 | — |
| HumanEval | Baar-Core | 39% | $0.614 | 38.6% cheaper |
HumanEval routes fewer tasks to the cheap tier because coding questions score high complexity — the router correctly identifies them as hard.
Live benchmark — real API calls (10 tasks per dataset)
baar-bench --dataset all --limit 10 --budget 2 \
--complexity-threshold 0.80 --coding-threshold 0.75 --seed 42
| Dataset | Strategy | Total cost | Savings vs always-big |
|---|---|---|---|
| MMLU | Always big | $0.002337 | — |
| MMLU | Baar-Core | $0.000137 | 94.1% cheaper |
| GSM8K | Always big | $0.027615 | — |
| GSM8K | Baar-Core | $0.002097 | 92.4% cheaper |
| HumanEval | Always big | $0.032125 | — |
| HumanEval | Baar-Core | $0.002743 | 91.5% cheaper |
Run it yourself: pip install baar-core datasets then baar-bench --limit 10 --mock (free) or add your API key for live results.
vs. alternatives
| Baar-Core | RouteLLM | LiteLLM | Portkey | |
|---|---|---|---|---|
| Hard local kill-switch (zero network calls) | ✅ | ❌ | ❌ | ❌ |
| Works fully offline | ✅ | ❌ | ❌ | ❌ |
| Per-user persistent budgets | ✅ SQLite/File | ❌ | Partial | ✅ (managed) |
| Semantic complexity routing | ✅ | ✅ | ✅ | ✅ |
| No proxy / no server required | ✅ | ✅ | ❌ | ❌ |
| Concurrent TOCTOU-safe reservations | ✅ | ❌ | ❌ | N/A |
| Open source (MIT) | ✅ | ✅ | ✅ | ❌ |
The key difference: every alternative routes and tracks. Baar-Core prevents — the exception is raised before a single byte leaves your machine.
Configuration
router = BAARRouter(
budget=0.10, # hard cap in USD
small_model="gpt-4o-mini", # cheap tier (any LiteLLM model)
big_model="gpt-4o", # capable tier
complexity_threshold=0.80, # 0.0–1.0: higher = more traffic to cheap model
min_cost_threshold=0.0001, # kill-switch floor — reject if any call costs more
routing_task_char_limit=500, # chars sent to routing LLM (head+mid+tail sample)
use_llm_router=True, # False = rule-based heuristic only (no routing cost)
small_fallback_models=["gpt-4o-mini-2024-07-18"], # failover chain
big_fallback_models=["gpt-4o-2024-08-06"],
telemetry_jsonl_path="telemetry.jsonl", # optional audit log
)
Budget pressure — as spend approaches the cap, the effective complexity threshold rises automatically. The big model becomes harder to justify as you run low, so more traffic shifts to cheap naturally.
Telemetry — inspect spend, routing splits, and reject rates:
baar-telemetry telemetry.jsonl
Resilience testing — adversarial scenarios (complexity games, tight budgets, padding attacks):
baar-stress
Security
Baar-Core maps to OWASP LLM10:2025 — Unbounded Consumption. The pre-flight kill-switch is a direct mitigation for Denial-of-Wallet attacks: even if an adversary crafts a prompt designed to trigger expensive model calls, the local budget cap catches it before any provider request is made.
Details: RESEARCH.md
License
MIT — LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file baar_core-0.3.0.tar.gz.
File metadata
- Download URL: baar_core-0.3.0.tar.gz
- Upload date:
- Size: 53.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c3dd88f72d2bef1e057a15d8495f854018bc655dd7fc88bba1a77d4c3466d11
|
|
| MD5 |
60fa6e44b522097344c540255db53089
|
|
| BLAKE2b-256 |
ba96c3a5f5f582d569be13009dc8073abddc51b7a275fc2eb01df1c5ba8c9441
|
File details
Details for the file baar_core-0.3.0-py3-none-any.whl.
File metadata
- Download URL: baar_core-0.3.0-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb10a1524f82196b65ed725ceb72f737864deed7bd6e1211c091541989c2df5e
|
|
| MD5 |
342e0b49d21f16261d8e32dae94dba57
|
|
| BLAKE2b-256 |
315643e730fa657e4786df6d1759ded002427338a50a6434a5eb33e3085e0e1e
|