Intelligent LLM agent cost optimization runtime — semantic caching + graduated budget enforcement

These details have not been verified by PyPI

Project links

Project description

AgentFuse

Intelligent LLM Agent Cost Optimization Runtime

PyPI License Python Tests Coverage

AgentFuse is a production-grade Python SDK that optimizes LLM costs through intelligent model routing, semantic caching, graduated budget enforcement, and unified observability across OpenAI, Anthropic, Google Gemini, DeepSeek, Mistral, and 12+ providers. Built with insights from LiteLLM, Portkey, and Helicone architectures, and backed by research from 8 academic papers.

The Problem

AI agents burn money without warning. A stuck loop can cost $50 in 10 minutes. Retries on a failed call can triple your bill. There is no built-in way to enforce per-run budgets across LLM providers.

The Solution

AgentFuse intercepts every LLM call with a two-tier semantic cache (Redis L1 exact-match + FAISS L2 vector similarity) achieving 87.5% hit rate, and enforces graduated budget policies that downgrade models, compress context, and terminate gracefully instead of burning your budget.

Key Numbers

Metric	Value
Cache hit rate	87.5% on repeated and paraphrased prompts
Cost reduction	71.8% ($0.24 vs $0.87 for same workload)
Model routing savings	Up to 85% via intelligent complexity routing (RouteLLM-inspired)
Tokens saved	179,445 per 100 calls
Integration effort	1 line of code (`completion()` gateway)
Test suite	1080 unit tests, 86% core coverage
Models supported	30+ with hot-reloadable pricing (GPT-5, Claude Opus 4.6, Gemini 2.5 Pro)
Providers supported	12 (OpenAI, Anthropic, Gemini, DeepSeek, Mistral, Groq, Together, xAI, Fireworks, OpenRouter, Ollama, vLLM)
Production subsystems	32 (cache, budget, routing, retry, dedup, alerting, anomaly detection, predictive routing, prompt compression, tool cost tracking, conversation estimation, hierarchical budgets)

Quickstart

pip install agentfuse-runtime

from agentfuse import completion

# One function for ANY provider — budget-enforced, cached, cost-tracked
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    budget_id="my_agent",
    budget_usd=5.00,
)

# Works with ANY provider — just change the model name:
response = completion(model="claude-sonnet-4-6", messages=[...], budget_id="run_2", budget_usd=3.00)
response = completion(model="gemini-2.5-pro", messages=[...], budget_id="run_3", budget_usd=1.00)
response = completion(model="deepseek/deepseek-chat", messages=[...])

# Enable intelligent routing — sends simple queries to cheap models automatically:
response = completion(model="gpt-4o", messages=[...], auto_route=True)

Session API — all-in-one context manager (recommended):

from agentfuse import AgentSession

with AgentSession("my_agent", budget_usd=5.00) as session:
    response = session.completion(messages=[{"role": "user", "content": "What is Python?"}])
    session.record_tool_call("web_search", cost=0.01)  # track tool costs too

receipt = session.get_receipt()
# {'total_cost_usd': 0.52, 'llm_cost_usd': 0.51, 'tool_cost_usd': 0.01, 'cache_hit_rate': 0.3, ...}

Check your spend (persists across restarts):

from agentfuse import get_spend_report
report = get_spend_report()
# {'total_usd': 4.52, 'by_model': {'gpt-4o': 3.1, 'claude-sonnet-4-6': 1.42}, ...}

Legacy monkey-patch API (still supported):

from agentfuse import wrap_openai
import openai

wrap_openai(budget_usd=5.00, run_id="my_agent")
# All subsequent openai.chat.completions.create() calls are now
# budget-enforced and semantically cached

Architecture

Request → Cache Lookup (L1 Redis → L2 FAISS) → Budget Check → LLM Call → Cost Recording → Cache Store
                                                    │
                                          ┌─────────┼─────────┐
                                          │         │         │
                                       60%: Alert  80%: Downgrade  90%: Compress  100%: Terminate

Graduated Budget Policies:

60% spent — alert callback triggered
80% spent — automatic model downgrade (e.g., GPT-4o to GPT-4o-mini, Claude Opus to Sonnet)
90% spent — context compression (system prompt + last 6 messages) and model downgrade
100% spent — graceful termination with partial results preserved

Features

Budget Enforcement

Per-run budget limits with graduated degradation policies
Atomic Redis Lua budget enforcement with microdollar precision (integer arithmetic, no floating-point drift)
In-memory budget store with TTL and LRU eviction for zero-dependency deployments
Async-safe budget operations via asyncio.Lock
Anthropic overflow pricing: automatic 2x input / 1.5x output when requests exceed 200K input tokens
Gemini Pro overflow pricing: automatic 2x when requests exceed 200K input tokens

Two-Tier Semantic Cache

L1 (Redis): SHA-256 exact-match lookup, sub-millisecond latency. Falls back to local TTLCache when Redis is unavailable
L2 (FAISS): IndexFlatIP vector similarity search with redis/langcache-embed-v2 embeddings (768-dim, purpose-built for semantic caching), 2-5ms latency
Cross-model contamination prevention: model name in L1 hash key, model prefix post-filter on L2 results
Tool-use queries never enter L2 semantic search (tool arguments are context-dependent)
Requests with temperature > 0.5 or side-effect tools (send_email, execute_trade, etc.) skip cache entirely
24-hour TTL with +/-10% jitter to prevent thundering herd
FAISS index persistence to disk (save_l2_index / load_l2_index)
LRU eviction at configurable max entries with full FAISS index rebuild

Token Counting

Provider-aware 4-tier fallback chain:
- Tier 1: Exact local tokenizer — GPT-4o/GPT-4.1/o1/o3/o4 use o200k_base, GPT-4/GPT-3.5 use cl100k_base
- Tier 2: Provider API counting (Anthropic/Gemini — planned)
- Tier 3: tiktoken cl100k_base with safety margins — Anthropic 1.20x, Gemini 1.25x, Mistral 1.15x, DeepSeek/Llama 1.10x, Grok 1.15x
- Tier 4: Character-based estimate (len(text) / 3.5) for unknown models
Multimodal content block handling (Anthropic vision format)

Model Registry

Hot-reloadable pricing for 30+ models at March 2026 rates (GPT-5, GPT-4.1, o3, o4-mini, Claude Sonnet/Haiku/Opus 4.x, Gemini 2.x, DeepSeek V3/R1, Mistral, Grok, Llama)
Per-model-family cache discounts: GPT-5 (90%), GPT-4.1 (75%), GPT-4o (50%), Gemini (90%)
4-tier pricing lookup: user overrides, exact match, fine-tuned model (2x base price), unknown (zero + warning)
LiteLLM remote refresh for automatic pricing updates
cached_input_cost() for provider cache discounts (Anthropic 90% off, DeepSeek 90% off)
Configurable refresh interval via AGENTFUSE_REGISTRY_REFRESH_HOURS environment variable

Provider Router

Automatic provider detection from model name with base URL routing for OpenAI-compatible providers
12 providers: OpenAI, Anthropic (native SDK), Gemini, DeepSeek, Mistral, Groq, Together AI, xAI, Fireworks AI, OpenRouter, Ollama (local), vLLM (self-hosted)
Fine-tuned model routing: ft:gpt-4o:org:name resolves to OpenAI
list_providers() to enumerate all configured providers

Unified Error Handling

classify_error() across OpenAI, Anthropic, Google GenAI, and httpx exceptions
Provider-specific handling:
- OpenAI insufficient_quota (429) is not retryable (billing issue, not rate limit)
- Anthropic OverloadedError (529) is always retryable
- Context window exceeded (400) is not retryable
Retry-After header extraction from provider responses
agentfuse_retry() decorator with tenacity exponential backoff
Cost-aware retry with automatic model downgrade on each retry attempt

Security

API key protection: mask_api_key() for safe logging, validate_api_key_format() per provider
Prompt injection detection: check_prompt_injection() flags known injection patterns
Invisible character stripping: Removes zero-width Unicode chars used in steganographic attacks
Response safety validation: Prevents caching XSS, javascript: URIs, and data: URI payloads
Per-tenant cache isolation: CacheAttack defense (arXiv 2601.23088) — no cross-tenant cache access
Dual-threshold verification: Higher similarity required for cache writes (0.95) than reads (0.90)
Input validation: Fail-fast on malformed inputs at gateway boundary
0 known vulnerabilities: Clean pip-audit on all dependencies
No dangerous patterns: No eval(), exec(), subprocess, pickle in codebase

Reliability

Semantic loop detection with FAISS sliding window and cost-aware thresholds
Streaming cost middleware with real-time per-token cost tracking and abort capability
Anthropic prompt caching middleware — auto-injects cache_control markers on static system messages above model-specific thresholds
Structured JSON cost receipts with per-step logging (model, tokens, cost, cache tier, latency)
Automatic model fallback on retryable errors (tries cheaper models from DEFAULT_CHAINS)
Request deduplication — coalesces identical in-flight requests to avoid duplicate API calls
GCRA rate limiting — smooth traffic shaping per tenant

Framework Integrations

Framework	Integration	Cache Support
OpenAI	`wrap_openai()` monkey-patch	Full (intercept + return cached)
Anthropic	`wrap_anthropic()` monkey-patch	Full (intercept + return cached)
LangChain	`AgentFuseChatModel` (BaseChatModel wrapper)	Full (wrapper checks cache before delegating)
CrewAI	`create_agentfuse_hooks()` (before/after hooks)	Full (before hook blocks call on cache hit)
OpenAI Agents SDK	`AgentFuseModel` / `AgentFuseModelProvider`	Full (async cache check in `get_response`)

Observability

OpenTelemetry: GenAI semantic convention v1.40 spans with gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model attributes
Structured Logging: structlog JSON output with automatic OTel trace/span ID injection and Datadog dd.trace_id / dd.span_id correlation
Prometheus Metrics:
- gen_ai_client_token_usage — histogram, tokens per operation
- gen_ai_client_operation_duration_seconds — histogram, latency
- agentfuse_cache_hits_total / agentfuse_cache_lookups_total — counters by model and tier
- agentfuse_cost_usd_total — counter by model and provider
- agentfuse_cost_per_request_usd — histogram with buckets [0.0001 ... 10.0]
- agentfuse_budget_remaining_usd — gauge per budget
- agentfuse_errors_total — counter by error type and provider
- agentfuse_model_fallbacks_total — counter by original and fallback model
All observability calls wrapped in try/except — failures never propagate to user code

Usage Normalization

Unified NormalizedUsage dataclass across all providers
Anthropic fix: input_tokens excludes cached tokens — AgentFuse adds cache_read_input_tokens + cache_creation_input_tokens for correct total
OpenAI: completion_tokens already includes reasoning_tokens — no double-counting
Gemini: thoughts_token_count billed as output — added to candidates_token_count

Integration Examples

OpenAI

from agentfuse import wrap_openai
wrap_openai(budget_usd=5.00, run_id="my_agent")

Anthropic

from agentfuse import wrap_anthropic
wrap_anthropic(budget_usd=5.00, run_id="my_agent")

LangChain

from agentfuse.integrations.langchain import AgentFuseChatModel
model = AgentFuseChatModel(inner=ChatOpenAI(), budget=5.00)
response = model.invoke([HumanMessage(content="Hello")])

CrewAI

from agentfuse.integrations.crewai import create_agentfuse_hooks
before_hook, after_hook = create_agentfuse_hooks(budget=5.00)

OpenAI Agents SDK

from agentfuse.integrations.openai_agents import AgentFuseModelProvider
provider = AgentFuseModelProvider(inner=your_provider, budget=5.00)
result = await Runner.run(agent, run_config=RunConfig(model_provider=provider))

Programmatic Usage

from agentfuse import BudgetEngine, ModelPricingEngine, TokenCounterAdapter

engine = BudgetEngine("run_123", budget_usd=5.00, model="gpt-4o")
pricing = ModelPricingEngine()
tokenizer = TokenCounterAdapter()

messages = [{"role": "user", "content": "What is the capital of France?"}]
tokens = tokenizer.count_messages(messages, "gpt-4o")
estimated_cost = pricing.input_cost("gpt-4o", tokens)

result_messages, active_model = engine.check_and_act(estimated_cost, messages)
# active_model may be downgraded if budget is running low

Comparison

Feature	AgentFuse	AgentBudget	LiteLLM	Portkey
Per-run budget enforcement	Yes	Yes	No	No
Semantic caching (87.5% hit rate)	Yes	No	No	Basic
Two-tier cache (Redis + FAISS)	Yes	No	No	No
Cross-model contamination prevention	Yes	No	No	No
Mid-run model switching (at 80% budget)	Yes	No	No	No
Context compression (at 90% budget)	Yes	No	No	No
Graceful termination + partial results	Yes	No	No	No
Semantic loop detection	Yes	No	No	No
Retry storm prevention with model downgrade	Yes	No	Yes	Yes
Streaming cost abort	Yes	No	No	No
Auto Anthropic prompt caching	Yes	No	No	No
Unified error classifier (3+ providers)	Yes	No	Partial	No
Anthropic/Gemini overflow pricing	Yes	No	No	No
OTel GenAI spans + Prometheus metrics	Yes	No	Yes	Yes
Structured cost receipts (JSON)	Yes	No	Basic	Basic
LangChain / CrewAI / OpenAI Agents SDK	Yes	No	Yes	Yes
Hot-reloadable model pricing (22+ models)	Yes	No	Yes	No
Atomic Redis Lua budget enforcement	Yes	No	No	No
Provider-aware token counting (6 providers)	Yes	No	Partial	No
FAISS index persistence	Yes	No	No	No

Install

pip install agentfuse-runtime              # Core (in-memory cache + budget)
pip install agentfuse-runtime[redis]       # + Redis cache and budget store
pip install agentfuse-runtime[otel]        # + OpenTelemetry tracing
pip install agentfuse-runtime[openai]      # + OpenAI SDK
pip install agentfuse-runtime[anthropic]   # + Anthropic SDK
pip install agentfuse-runtime[langchain]   # + LangChain Core
pip install agentfuse-runtime[all]         # Everything

Requirements: Python 3.11+

Changelog

v0.2.1 — Production Fixes (March 2026)

Fixed: environment variable rate limiting (AGENTFUSE_RATE_LIMIT_RPS) now works correctly
Fixed: configure(output_guardrails=...) now properly sets module-level guardrails
Fixed: output guardrails are checked before caching responses
Fixed: streaming responses validated before cache storage (both OpenAI and Anthropic)
Fixed: acompletion() now has automatic fallback chain matching sync completion()
Fixed: async provider uses asyncio.get_running_loop() (deprecated get_event_loop removed)
Fixed: async streaming injects stream_options for OpenAI usage tracking
Added: SDK client caching across requests (reuses connection pools)
Added: AgentSession async context manager (async with)
Added: real-world end-to-end test script (examples/e2e_real_test.py)
Added: CI coverage threshold enforcement
Filled: all example files with working code
85 public exports, 1080 tests passing

v0.2.0 — Production Rebuild (March 2026)

New Modules:

TwoTierCacheMiddleware — Redis L1 exact-match + FAISS L2 semantic search with redis/langcache-embed-v2
ModelRegistry — hot-reloadable pricing for 22+ models with LiteLLM remote refresh
ProviderRouter — automatic provider detection and base URL routing for 12 providers
RedisBudgetStore — atomic budget enforcement via Redis Lua scripts with microdollar precision
InMemoryBudgetStore / AsyncInMemoryBudgetStore — thread-safe budget stores with TTL and LRU
NormalizedUsage / extract_usage() — unified token usage extraction across all providers
classify_error() / ClassifiedError — unified error classification with Retry-After extraction
agentfuse_retry() — tenacity-based retry decorator with provider-aware predicate
OTel GenAI spans, structlog JSON logging, Prometheus metrics (11 metric types)

Framework Integrations (rebuilt):

LangChain: AgentFuseChatModel — BaseChatModel wrapper (callbacks are observe-only and cannot return cached responses)
CrewAI: create_agentfuse_hooks() — before/after hooks with side-channel cached response injection
OpenAI Agents SDK: AgentFuseModel / AgentFuseModelProvider — async Model interface with non-blocking cache store

Bug Fixes:

Cache key design: SHA-256 with model always first component — cross-model contamination eliminated
Token counting: GPT-4o/GPT-4.1/o-series use o200k_base (was incorrectly using cl100k_base)
Safety margins: Anthropic 1.20x (was 1.15x), Gemini 1.25x (was 1.05x) — prevents budget underrun
Thread safety: instance-level locks + ContextVar for per-run isolation (was using class-level locks causing serialization)
L2 cache eviction: vectors stored alongside metadata for correct FAISS index rebuild (was losing all vectors)
90% budget policy: now correctly applies both model downgrade and context compression
Retry module: uses classify_error() instead of string matching on exception names
L2 metadata filter: correct prefix detection for Mistral, Grok, Llama models
Double-wrapping prevention in wrap_openai() / wrap_anthropic()

Pricing:

Anthropic overflow pricing (>200K input tokens: 2x input, 1.5x output)
Gemini Pro overflow pricing (>200K input tokens: 2x)
Cached input cost method for provider cache discounts (Anthropic 90%, DeepSeek 90%)

Quality:

260 unit tests, 0 construction-only tests, 86% core module coverage
PEP 561 py.typed marker for type checking support

v0.1.0 — Initial Prototype (February 2026)

BudgetEngine with graduated policies, CacheMiddleware with Intent Atoms FAISS 3-tier cache
wrap_openai(), wrap_anthropic() monkey-patches
LoopDetectionMiddleware, CostAwareRetry, StreamingCostMiddleware, PromptCachingMiddleware, CostReceiptEmitter
LangChain, CrewAI, OpenAI Agents SDK integrations (callback-based)
34 tests passing

Roadmap

Python SDK v0.2.0 — production-ready
TypeScript SDK
Cloud dashboard with real-time cost monitoring
Anomaly detection (per-agent-type baseline profiling)
Batch pricing support (50% discount with 24h SLA)
Multi-agent budget sharing

Contributing

git clone https://github.com/vinaybudideti/agentfuse.git
cd agentfuse
pip install -e ".[all]"
pytest tests/unit/ -v

License

MIT

Built by @vinaybudideti

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Mar 29, 2026

This version

0.2.1

Mar 29, 2026

0.0.1

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentfuse_runtime-0.2.1.tar.gz (166.7 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentfuse_runtime-0.2.1-py3-none-any.whl (204.4 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file agentfuse_runtime-0.2.1.tar.gz.

File metadata

Download URL: agentfuse_runtime-0.2.1.tar.gz
Upload date: Mar 29, 2026
Size: 166.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for agentfuse_runtime-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`48df20bd5b6936505d4b8a5960319e269326024610e1fde8239eeefc20017cd3`
MD5	`6d6b8c74d8941001531f882ce11f5ddc`
BLAKE2b-256	`4e9786d4d371a457a19510e15db6516338f386ec43df7b194689fb1596b2bf04`

See more details on using hashes here.

File details

Details for the file agentfuse_runtime-0.2.1-py3-none-any.whl.

File metadata

Download URL: agentfuse_runtime-0.2.1-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 204.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for agentfuse_runtime-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9230a9a7ae8a258202655d2063538d003f49fec16dc58c9cb7e2d4f767a2de51`
MD5	`b93f477a7fb3075ac588ee6b267430e9`
BLAKE2b-256	`00fbf01fa2adee8df9d179d71d966999d1b98ffb65d8a5391d8b6cfd0b024746`

See more details on using hashes here.

agentfuse-runtime 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentFuse

The Problem

The Solution

Key Numbers

Quickstart

Architecture

Features

Budget Enforcement

Two-Tier Semantic Cache

Token Counting

Model Registry

Provider Router

Unified Error Handling

Security

Reliability

Framework Integrations

Observability

Usage Normalization

Integration Examples

OpenAI

Anthropic

LangChain

CrewAI

OpenAI Agents SDK

Programmatic Usage

Comparison

Install

Changelog

v0.2.1 — Production Fixes (March 2026)

v0.2.0 — Production Rebuild (March 2026)

v0.1.0 — Initial Prototype (February 2026)

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes