Skip to main content

Production patterns for indie AI agents — token bucket, cost-aware routing, retry, observability.

Project description

agentprod

Production patterns for indie AI agents — extracted from running a multi-LLM trading agent in production.

agentprod is a small Python library of the four things you reach for once your AI agent leaves your laptop and starts charging your credit card at 3 AM:

Module What it gives you Why it exists
Router Cost-aware model selection (cheapest model that meets the quality bar) Burning Sonnet on "what is the price?" is how you go bankrupt
Throttle Async token bucket with jitter + hard timeout Provider rate limits don't just slow you down, they cascade
retry_call / retry_async Pattern-based detection of transient failures LLM SDKs change their exception classes every release; the error string is stable
CostTracker Per-call USD ledger with arbitrary labels (agent, user, route) "Which agent burned $40 last night" is a question the provider dashboard can't answer

No hard dependency on LangChain / LangGraph / OpenAI SDK. Bring your own LLM client. agentprod just gives you the production scaffolding around it.


Status

Alpha (v0.0.1). APIs may change before 1.0. Battle-tested in one production system; tests cover the core paths but the public surface is intentionally small until usage shapes it.


Install

# Pure stdlib — no required deps
pip install agentprod

# With tenacity for richer retry semantics
pip install "agentprod[retry]"

Python 3.10+.


Quickstart

The full example is in examples/quickstart.py. Skeleton:

import asyncio
from agentprod import (
    Complexity, Router, Throttle, retry_async,
    CostTracker, ModelPricing,
)

router = Router(model_for={
    Complexity.SIMPLE:   "gpt-4o-mini",
    Complexity.MODERATE: "gpt-4o",
    Complexity.COMPLEX:  "claude-sonnet-4-6",
})
throttle = Throttle(capacity=10, refill_per_sec=10)
PRICING = {
    "gpt-4o-mini": ModelPricing(input_per_1k=0.00015, output_per_1k=0.0006),
    "gpt-4o":      ModelPricing(input_per_1k=0.0025,  output_per_1k=0.01),
}
cost = CostTracker(jsonl_path=".data/cost.jsonl")

async def handle(query: str, *, agent: str) -> str:
    model = router.select(query)
    await throttle.acquire(timeout=1.0, label=f"llm:{model}")
    text, in_tok, out_tok = await retry_async(
        lambda: your_llm_call(model, query),
        max_attempts=3,
    )
    cost.record(
        model=model,
        input_tokens=in_tok, output_tokens=out_tok,
        pricing=PRICING[model],
        labels={"agent": agent},
    )
    return text

Each piece in 30 seconds

Router — cost-aware model selection

Pick the cheapest model that can handle the query:

from agentprod import Complexity, Router

router = Router(
    model_for={
        Complexity.SIMPLE:   "gpt-4o-mini",
        Complexity.MODERATE: "gpt-4o",
        Complexity.COMPLEX:  "claude-sonnet-4-6",
    },
    # Optional: bump domain terms to a higher tier
    complex_keywords=("DCF", "valuation", "portfolio"),
    simple_keywords=("price of", "ticker"),
)

router.select("what is the price of AAPL?")
# → "gpt-4o-mini"

router.select("compare AAPL and MSFT cash flow over 5 years")
# → "claude-sonnet-4-6"

Three-tier classifier (simple / moderate / complex) using:

  1. Simple-keyword regex (wins over everything — short queries shouldn't hit the expensive model just because they happen to contain a long word)
  2. Complex-keyword count
  3. Word-count thresholds (CJK width-aware — works on Korean / Japanese / Chinese mixed input)

Throttle — asyncio token bucket

from agentprod import Throttle, ThrottleTimeout

bucket = Throttle(
    capacity=12,             # max burst size
    refill_per_sec=12,       # sustained rps
    jitter_ms=(5, 30),       # avoid thundering herd
    on_acquire=lambda r: log.info("throttle wait: %s", r),
)

try:
    await bucket.acquire(timeout=1.0, label="GET /quote")
    # ... make your call ...
except ThrottleTimeout:
    # bucket couldn't free a slot in time — drop and try next cycle
    return None

Why not aiolimiter / asyncio-throttle? Two things:

  • Hard timeout with explicit exception. Burst > timeout is usually a signal to drop the request, not to keep waiting.
  • Metrics callback. Sync or async, exceptions swallowed. You ship throttle waits to your observability stack without wrapping the bucket.

retry — pattern-based transient detection

from agentprod import is_retryable, retry_call, retry_async

# Decision function — drop into any retry library
if is_retryable(exc):
    ...

# Or use the wrapper (uses tenacity if installed, manual backoff otherwise)
result = retry_call(
    lambda: openai_client.chat.completions.create(...),
    max_attempts=3,
)

# Async version
result = await retry_async(
    lambda: anthropic_client.messages.create(...),
    max_attempts=3,
)

Default patterns cover OpenAI / Anthropic / Google / bare httpx error strings: rate limit, 429, 500, 502, 503, overloaded, timeout, server error, too many requests, connection reset.

Why string matching: provider SDKs reshuffle their exception classes every release. The message is the most stable contract.

CostTracker — per-call ledger with labels

from agentprod import CostTracker, ModelPricing

pricing = ModelPricing(
    input_per_1k=0.0025,
    output_per_1k=0.01,
    cached_input_per_1k=0.00125,  # optional, for providers with prompt caching
)

tracker = CostTracker(jsonl_path=".data/cost.jsonl")

tracker.record(
    model="gpt-4o",
    input_tokens=1234, output_tokens=567, cached_input_tokens=800,
    pricing=pricing,
    labels={"agent": "fundamental_analyst", "user": "u_123", "route": "/analyze"},
)

tracker.total_usd()                       # 12.4583
tracker.total_usd(where={"user": "u_123"})  # 0.42
tracker.by_label("agent")                  # {"fundamental_analyst": 0.42, ...}
tracker.by_model()                         # {"gpt-4o": 12.4583}

Why bring your own pricing: model prices change weekly. A library that ships its own catalog goes stale fast.


Why these four

These are the four pieces I rebuilt in three different agent codebases before deciding to extract them once. Every production AI agent eventually needs:

  1. Cost discipline at the routing layer. Per-call cost discipline alone isn't enough — by the time you see a $300 bill, the spend is sunk. Routing is where the economics start.
  2. Rate-limit resilience that doesn't cascade. A single 429 turns into 50 once your retries pile up. Token bucket + hard timeout breaks the cascade.
  3. Retry that survives SDK upgrades. I've had three OpenAI SDK upgrades break my retry code because the exception classes moved. String matching the message has outlived all of them.
  4. Cost attribution by label, not just total. "We spent $40 last night" is useless. "The fundamental_analyst agent spent $38 on retries against gpt-4o" is fixable.

Everything else in your agent is your business logic and shouldn't live in a library.


Non-goals

  • No LLM client wrapping. Use OpenAI / Anthropic / LangChain / your own. agentprod gives you the scaffolding around the call, not the call itself.
  • No model catalog. Prices change too fast.
  • No vector DB / RAG / evaluation. Different problem domain.
  • No multiprocessing. The Throttle is asyncio-only by design. If you need cross-process throttling, you want Redis-backed leaky bucket.

Development

git clone https://github.com/whdrnr2583-cmd/agentprod
cd agentprod
pip install -e ".[dev]"
pytest

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprod-0.0.1.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentprod-0.0.1-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file agentprod-0.0.1.tar.gz.

File metadata

  • Download URL: agentprod-0.0.1.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentprod-0.0.1.tar.gz
Algorithm Hash digest
SHA256 05928079426a9b91729404243c5d7923582f847f33ebfff9a96b98de90614e06
MD5 fb51575bf110a81bd5df6562dc1bb94c
BLAKE2b-256 8ff52117b5128367871705fa3d301e12c9e68b295beb9f1e094f609987bfd70f

See more details on using hashes here.

File details

Details for the file agentprod-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: agentprod-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentprod-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fd87cb96d7d5e22f20c22631bd52d7f8c32e844e6a19c4f1811c8aff184831c1
MD5 863896d47df6b7f99345c1f2816e37a7
BLAKE2b-256 1b214e9135a1c6b3adadeb75f3cf97bd60a7397d40a208c1917754fe5d3ef741

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page