Skip to main content

Intelligent LLM agent cost optimization runtime — semantic caching + graduated budget enforcement

Project description

AgentFuse

Intelligent LLM Agent Cost Optimization Runtime

PyPI License Python Tests Coverage

AgentFuse is a production-grade Python SDK that optimizes LLM costs through intelligent model routing, semantic caching, graduated budget enforcement, and unified observability across OpenAI, Anthropic, Google Gemini, DeepSeek, Mistral, and 12+ providers. Built with insights from LiteLLM, Portkey, and Helicone architectures, and backed by research from 8 academic papers.

The Problem

AI agents burn money without warning. A stuck loop can cost $50 in 10 minutes. Retries on a failed call can triple your bill. There is no built-in way to enforce per-run budgets across LLM providers.

The Solution

AgentFuse intercepts every LLM call with a two-tier semantic cache (Redis L1 exact-match + FAISS L2 vector similarity) achieving 87.5% hit rate, and enforces graduated budget policies that downgrade models, compress context, and terminate gracefully instead of burning your budget.

Key Numbers

Metric Value
Cache hit rate 87.5% on repeated and paraphrased prompts
Cost reduction 71.8% ($0.24 vs $0.87 for same workload)
Model routing savings Up to 85% via intelligent complexity routing (RouteLLM-inspired)
Tokens saved 179,445 per 100 calls
Integration effort 1 line of code (completion() gateway)
Test suite 1080 unit tests, 86% core coverage
Models supported 30+ with hot-reloadable pricing (GPT-5, Claude Opus 4.6, Gemini 2.5 Pro)
Providers supported 12 (OpenAI, Anthropic, Gemini, DeepSeek, Mistral, Groq, Together, xAI, Fireworks, OpenRouter, Ollama, vLLM)
Production subsystems 32 (cache, budget, routing, retry, dedup, alerting, anomaly detection, predictive routing, prompt compression, tool cost tracking, conversation estimation, hierarchical budgets)

Quickstart

pip install agentfuse-runtime
from agentfuse import completion

# One function for ANY provider — budget-enforced, cached, cost-tracked
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    budget_id="my_agent",
    budget_usd=5.00,
)

# Works with ANY provider — just change the model name:
response = completion(model="claude-sonnet-4-6", messages=[...], budget_id="run_2", budget_usd=3.00)
response = completion(model="gemini-2.5-pro", messages=[...], budget_id="run_3", budget_usd=1.00)
response = completion(model="deepseek/deepseek-chat", messages=[...])

# Enable intelligent routing — sends simple queries to cheap models automatically:
response = completion(model="gpt-4o", messages=[...], auto_route=True)

Session API — all-in-one context manager (recommended):

from agentfuse import AgentSession

with AgentSession("my_agent", budget_usd=5.00) as session:
    response = session.completion(messages=[{"role": "user", "content": "What is Python?"}])
    session.record_tool_call("web_search", cost=0.01)  # track tool costs too

receipt = session.get_receipt()
# {'total_cost_usd': 0.52, 'llm_cost_usd': 0.51, 'tool_cost_usd': 0.01, 'cache_hit_rate': 0.3, ...}

Check your spend (persists across restarts):

from agentfuse import get_spend_report
report = get_spend_report()
# {'total_usd': 4.52, 'by_model': {'gpt-4o': 3.1, 'claude-sonnet-4-6': 1.42}, ...}

Legacy monkey-patch API (still supported):

from agentfuse import wrap_openai
import openai

wrap_openai(budget_usd=5.00, run_id="my_agent")
# All subsequent openai.chat.completions.create() calls are now
# budget-enforced and semantically cached

Architecture

Request → Cache Lookup (L1 Redis → L2 FAISS) → Budget Check → LLM Call → Cost Recording → Cache Store
                                                    │
                                          ┌─────────┼─────────┐
                                          │         │         │
                                       60%: Alert  80%: Downgrade  90%: Compress  100%: Terminate

Graduated Budget Policies:

  • 60% spent — alert callback triggered
  • 80% spent — automatic model downgrade (e.g., GPT-4o to GPT-4o-mini, Claude Opus to Sonnet)
  • 90% spent — context compression (system prompt + last 6 messages) and model downgrade
  • 100% spent — graceful termination with partial results preserved

Features

Budget Enforcement

  • Per-run budget limits with graduated degradation policies
  • Atomic Redis Lua budget enforcement with microdollar precision (integer arithmetic, no floating-point drift)
  • In-memory budget store with TTL and LRU eviction for zero-dependency deployments
  • Async-safe budget operations via asyncio.Lock
  • Anthropic overflow pricing: automatic 2x input / 1.5x output when requests exceed 200K input tokens
  • Gemini Pro overflow pricing: automatic 2x when requests exceed 200K input tokens

Two-Tier Semantic Cache

  • L1 (Redis): SHA-256 exact-match lookup, sub-millisecond latency. Falls back to local TTLCache when Redis is unavailable
  • L2 (FAISS): IndexFlatIP vector similarity search with redis/langcache-embed-v2 embeddings (768-dim, purpose-built for semantic caching), 2-5ms latency
  • Cross-model contamination prevention: model name in L1 hash key, model prefix post-filter on L2 results
  • Tool-use queries never enter L2 semantic search (tool arguments are context-dependent)
  • Requests with temperature > 0.5 or side-effect tools (send_email, execute_trade, etc.) skip cache entirely
  • 24-hour TTL with +/-10% jitter to prevent thundering herd
  • FAISS index persistence to disk (save_l2_index / load_l2_index)
  • LRU eviction at configurable max entries with full FAISS index rebuild

Token Counting

  • Provider-aware 4-tier fallback chain:
    • Tier 1: Exact local tokenizer — GPT-4o/GPT-4.1/o1/o3/o4 use o200k_base, GPT-4/GPT-3.5 use cl100k_base
    • Tier 2: Provider API counting (Anthropic/Gemini — planned)
    • Tier 3: tiktoken cl100k_base with safety margins — Anthropic 1.20x, Gemini 1.25x, Mistral 1.15x, DeepSeek/Llama 1.10x, Grok 1.15x
    • Tier 4: Character-based estimate (len(text) / 3.5) for unknown models
  • Multimodal content block handling (Anthropic vision format)

Model Registry

  • Hot-reloadable pricing for 30+ models at March 2026 rates (GPT-5, GPT-4.1, o3, o4-mini, Claude Sonnet/Haiku/Opus 4.x, Gemini 2.x, DeepSeek V3/R1, Mistral, Grok, Llama)
  • Per-model-family cache discounts: GPT-5 (90%), GPT-4.1 (75%), GPT-4o (50%), Gemini (90%)
  • 4-tier pricing lookup: user overrides, exact match, fine-tuned model (2x base price), unknown (zero + warning)
  • LiteLLM remote refresh for automatic pricing updates
  • cached_input_cost() for provider cache discounts (Anthropic 90% off, DeepSeek 90% off)
  • Configurable refresh interval via AGENTFUSE_REGISTRY_REFRESH_HOURS environment variable

Provider Router

  • Automatic provider detection from model name with base URL routing for OpenAI-compatible providers
  • 12 providers: OpenAI, Anthropic (native SDK), Gemini, DeepSeek, Mistral, Groq, Together AI, xAI, Fireworks AI, OpenRouter, Ollama (local), vLLM (self-hosted)
  • Fine-tuned model routing: ft:gpt-4o:org:name resolves to OpenAI
  • list_providers() to enumerate all configured providers

Unified Error Handling

  • classify_error() across OpenAI, Anthropic, Google GenAI, and httpx exceptions
  • Provider-specific handling:
    • OpenAI insufficient_quota (429) is not retryable (billing issue, not rate limit)
    • Anthropic OverloadedError (529) is always retryable
    • Context window exceeded (400) is not retryable
  • Retry-After header extraction from provider responses
  • agentfuse_retry() decorator with tenacity exponential backoff
  • Cost-aware retry with automatic model downgrade on each retry attempt

Security

  • API key protection: mask_api_key() for safe logging, validate_api_key_format() per provider
  • Prompt injection detection: check_prompt_injection() flags known injection patterns
  • Invisible character stripping: Removes zero-width Unicode chars used in steganographic attacks
  • Response safety validation: Prevents caching XSS, javascript: URIs, and data: URI payloads
  • Per-tenant cache isolation: CacheAttack defense (arXiv 2601.23088) — no cross-tenant cache access
  • Dual-threshold verification: Higher similarity required for cache writes (0.95) than reads (0.90)
  • Input validation: Fail-fast on malformed inputs at gateway boundary
  • 0 known vulnerabilities: Clean pip-audit on all dependencies
  • No dangerous patterns: No eval(), exec(), subprocess, pickle in codebase

Reliability

  • Semantic loop detection with FAISS sliding window and cost-aware thresholds
  • Streaming cost middleware with real-time per-token cost tracking and abort capability
  • Anthropic prompt caching middleware — auto-injects cache_control markers on static system messages above model-specific thresholds
  • Structured JSON cost receipts with per-step logging (model, tokens, cost, cache tier, latency)
  • Automatic model fallback on retryable errors (tries cheaper models from DEFAULT_CHAINS)
  • Request deduplication — coalesces identical in-flight requests to avoid duplicate API calls
  • GCRA rate limiting — smooth traffic shaping per tenant

Framework Integrations

Framework Integration Cache Support
OpenAI wrap_openai() monkey-patch Full (intercept + return cached)
Anthropic wrap_anthropic() monkey-patch Full (intercept + return cached)
LangChain AgentFuseChatModel (BaseChatModel wrapper) Full (wrapper checks cache before delegating)
CrewAI create_agentfuse_hooks() (before/after hooks) Full (before hook blocks call on cache hit)
OpenAI Agents SDK AgentFuseModel / AgentFuseModelProvider Full (async cache check in get_response)

Observability

  • OpenTelemetry: GenAI semantic convention v1.40 spans with gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model attributes
  • Structured Logging: structlog JSON output with automatic OTel trace/span ID injection and Datadog dd.trace_id / dd.span_id correlation
  • Prometheus Metrics:
    • gen_ai_client_token_usage — histogram, tokens per operation
    • gen_ai_client_operation_duration_seconds — histogram, latency
    • agentfuse_cache_hits_total / agentfuse_cache_lookups_total — counters by model and tier
    • agentfuse_cost_usd_total — counter by model and provider
    • agentfuse_cost_per_request_usd — histogram with buckets [0.0001 ... 10.0]
    • agentfuse_budget_remaining_usd — gauge per budget
    • agentfuse_errors_total — counter by error type and provider
    • agentfuse_model_fallbacks_total — counter by original and fallback model
  • All observability calls wrapped in try/except — failures never propagate to user code

Usage Normalization

  • Unified NormalizedUsage dataclass across all providers
  • Anthropic fix: input_tokens excludes cached tokens — AgentFuse adds cache_read_input_tokens + cache_creation_input_tokens for correct total
  • OpenAI: completion_tokens already includes reasoning_tokens — no double-counting
  • Gemini: thoughts_token_count billed as output — added to candidates_token_count

Integration Examples

OpenAI

from agentfuse import wrap_openai
wrap_openai(budget_usd=5.00, run_id="my_agent")

Anthropic

from agentfuse import wrap_anthropic
wrap_anthropic(budget_usd=5.00, run_id="my_agent")

LangChain

from agentfuse.integrations.langchain import AgentFuseChatModel
model = AgentFuseChatModel(inner=ChatOpenAI(), budget=5.00)
response = model.invoke([HumanMessage(content="Hello")])

CrewAI

from agentfuse.integrations.crewai import create_agentfuse_hooks
before_hook, after_hook = create_agentfuse_hooks(budget=5.00)

OpenAI Agents SDK

from agentfuse.integrations.openai_agents import AgentFuseModelProvider
provider = AgentFuseModelProvider(inner=your_provider, budget=5.00)
result = await Runner.run(agent, run_config=RunConfig(model_provider=provider))

Programmatic Usage

from agentfuse import BudgetEngine, ModelPricingEngine, TokenCounterAdapter

engine = BudgetEngine("run_123", budget_usd=5.00, model="gpt-4o")
pricing = ModelPricingEngine()
tokenizer = TokenCounterAdapter()

messages = [{"role": "user", "content": "What is the capital of France?"}]
tokens = tokenizer.count_messages(messages, "gpt-4o")
estimated_cost = pricing.input_cost("gpt-4o", tokens)

result_messages, active_model = engine.check_and_act(estimated_cost, messages)
# active_model may be downgraded if budget is running low

Comparison

Feature AgentFuse AgentBudget LiteLLM Portkey
Per-run budget enforcement Yes Yes No No
Semantic caching (87.5% hit rate) Yes No No Basic
Two-tier cache (Redis + FAISS) Yes No No No
Cross-model contamination prevention Yes No No No
Mid-run model switching (at 80% budget) Yes No No No
Context compression (at 90% budget) Yes No No No
Graceful termination + partial results Yes No No No
Semantic loop detection Yes No No No
Retry storm prevention with model downgrade Yes No Yes Yes
Streaming cost abort Yes No No No
Auto Anthropic prompt caching Yes No No No
Unified error classifier (3+ providers) Yes No Partial No
Anthropic/Gemini overflow pricing Yes No No No
OTel GenAI spans + Prometheus metrics Yes No Yes Yes
Structured cost receipts (JSON) Yes No Basic Basic
LangChain / CrewAI / OpenAI Agents SDK Yes No Yes Yes
Hot-reloadable model pricing (22+ models) Yes No Yes No
Atomic Redis Lua budget enforcement Yes No No No
Provider-aware token counting (6 providers) Yes No Partial No
FAISS index persistence Yes No No No

Install

pip install agentfuse-runtime              # Core (in-memory cache + budget)
pip install agentfuse-runtime[redis]       # + Redis cache and budget store
pip install agentfuse-runtime[otel]        # + OpenTelemetry tracing
pip install agentfuse-runtime[openai]      # + OpenAI SDK
pip install agentfuse-runtime[anthropic]   # + Anthropic SDK
pip install agentfuse-runtime[langchain]   # + LangChain Core
pip install agentfuse-runtime[all]         # Everything

Requirements: Python 3.11+

Changelog

v0.2.1 — Production Fixes (March 2026)

  • Fixed: environment variable rate limiting (AGENTFUSE_RATE_LIMIT_RPS) now works correctly
  • Fixed: configure(output_guardrails=...) now properly sets module-level guardrails
  • Fixed: output guardrails are checked before caching responses
  • Fixed: streaming responses validated before cache storage (both OpenAI and Anthropic)
  • Fixed: acompletion() now has automatic fallback chain matching sync completion()
  • Fixed: async provider uses asyncio.get_running_loop() (deprecated get_event_loop removed)
  • Fixed: async streaming injects stream_options for OpenAI usage tracking
  • Added: SDK client caching across requests (reuses connection pools)
  • Added: AgentSession async context manager (async with)
  • Added: real-world end-to-end test script (examples/e2e_real_test.py)
  • Added: CI coverage threshold enforcement
  • Filled: all example files with working code
  • 85 public exports, 1080 tests passing

v0.2.0 — Production Rebuild (March 2026)

New Modules:

  • TwoTierCacheMiddleware — Redis L1 exact-match + FAISS L2 semantic search with redis/langcache-embed-v2
  • ModelRegistry — hot-reloadable pricing for 22+ models with LiteLLM remote refresh
  • ProviderRouter — automatic provider detection and base URL routing for 12 providers
  • RedisBudgetStore — atomic budget enforcement via Redis Lua scripts with microdollar precision
  • InMemoryBudgetStore / AsyncInMemoryBudgetStore — thread-safe budget stores with TTL and LRU
  • NormalizedUsage / extract_usage() — unified token usage extraction across all providers
  • classify_error() / ClassifiedError — unified error classification with Retry-After extraction
  • agentfuse_retry() — tenacity-based retry decorator with provider-aware predicate
  • OTel GenAI spans, structlog JSON logging, Prometheus metrics (11 metric types)

Framework Integrations (rebuilt):

  • LangChain: AgentFuseChatModelBaseChatModel wrapper (callbacks are observe-only and cannot return cached responses)
  • CrewAI: create_agentfuse_hooks() — before/after hooks with side-channel cached response injection
  • OpenAI Agents SDK: AgentFuseModel / AgentFuseModelProvider — async Model interface with non-blocking cache store

Bug Fixes:

  • Cache key design: SHA-256 with model always first component — cross-model contamination eliminated
  • Token counting: GPT-4o/GPT-4.1/o-series use o200k_base (was incorrectly using cl100k_base)
  • Safety margins: Anthropic 1.20x (was 1.15x), Gemini 1.25x (was 1.05x) — prevents budget underrun
  • Thread safety: instance-level locks + ContextVar for per-run isolation (was using class-level locks causing serialization)
  • L2 cache eviction: vectors stored alongside metadata for correct FAISS index rebuild (was losing all vectors)
  • 90% budget policy: now correctly applies both model downgrade and context compression
  • Retry module: uses classify_error() instead of string matching on exception names
  • L2 metadata filter: correct prefix detection for Mistral, Grok, Llama models
  • Double-wrapping prevention in wrap_openai() / wrap_anthropic()

Pricing:

  • Anthropic overflow pricing (>200K input tokens: 2x input, 1.5x output)
  • Gemini Pro overflow pricing (>200K input tokens: 2x)
  • Cached input cost method for provider cache discounts (Anthropic 90%, DeepSeek 90%)

Quality:

  • 260 unit tests, 0 construction-only tests, 86% core module coverage
  • PEP 561 py.typed marker for type checking support

v0.1.0 — Initial Prototype (February 2026)

  • BudgetEngine with graduated policies, CacheMiddleware with Intent Atoms FAISS 3-tier cache
  • wrap_openai(), wrap_anthropic() monkey-patches
  • LoopDetectionMiddleware, CostAwareRetry, StreamingCostMiddleware, PromptCachingMiddleware, CostReceiptEmitter
  • LangChain, CrewAI, OpenAI Agents SDK integrations (callback-based)
  • 34 tests passing

Roadmap

  • Python SDK v0.2.0 — production-ready
  • TypeScript SDK
  • Cloud dashboard with real-time cost monitoring
  • Anomaly detection (per-agent-type baseline profiling)
  • Batch pricing support (50% discount with 24h SLA)
  • Multi-agent budget sharing

Contributing

git clone https://github.com/vinaybudideti/agentfuse.git
cd agentfuse
pip install -e ".[all]"
pytest tests/unit/ -v

License

MIT


Built by @vinaybudideti

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentfuse_runtime-0.2.1.tar.gz (166.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentfuse_runtime-0.2.1-py3-none-any.whl (204.4 kB view details)

Uploaded Python 3

File details

Details for the file agentfuse_runtime-0.2.1.tar.gz.

File metadata

  • Download URL: agentfuse_runtime-0.2.1.tar.gz
  • Upload date:
  • Size: 166.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for agentfuse_runtime-0.2.1.tar.gz
Algorithm Hash digest
SHA256 48df20bd5b6936505d4b8a5960319e269326024610e1fde8239eeefc20017cd3
MD5 6d6b8c74d8941001531f882ce11f5ddc
BLAKE2b-256 4e9786d4d371a457a19510e15db6516338f386ec43df7b194689fb1596b2bf04

See more details on using hashes here.

File details

Details for the file agentfuse_runtime-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for agentfuse_runtime-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9230a9a7ae8a258202655d2063538d003f49fec16dc58c9cb7e2d4f767a2de51
MD5 b93f477a7fb3075ac588ee6b267430e9
BLAKE2b-256 00fbf01fa2adee8df9d179d71d966999d1b98ffb65d8a5391d8b6cfd0b024746

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page