Skip to main content

The economic brain for AI agents — meter costs and enforce guardrails.

Project description

Paygent

The economic brain for AI agents -- meter costs and enforce guardrails.

Paygent is a Python SDK that auto-instruments LLM API calls to meter per-user costs (including model-level token tracking), enforce spending guardrails, and sync usage data to the Paygent backend. It's the missing runtime enforcement layer for AI agent applications.

Quick Start

pip install paygent

Configure your plans once on the Paygent dashboard, then in your app:

import openai
from paygent import Paygent, paygent_context

pg = Paygent.init(api_key="pg_live_...")

# Wrap LLM calls in paygent_context with the end user's ID.
# Paygent auto-loads the user's plan on first use — no extra setup.
with paygent_context(user_id="user_123"):
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )

# Query usage any time
usage = pg.get_usage("user_123")
print(f"Period cost: ${usage.period_cost:.4f}")

No backend? See Local Mode for running fully offline.

Features

  • Auto-instrumentation -- Monkey-patches OpenAI and Anthropic SDKs. The LLM call line itself is unchanged — you just wrap it in paygent_context(user_id=...). Works transparently with most frameworks that route through these SDKs (tested: LangChain, LangGraph, CrewAI).
  • Per-user metering -- Track token consumption per user, per session, per model in real time.
  • Spending guardrails -- Soft gates (warnings) and hard gates (blocks) for period spend, session spend, and per-model token limits.
  • Concurrency-safe -- Two-phase reservation pattern protects against hard-gate overrun when concurrent calls race at a cap boundary.
  • Model-level tracking -- Track and limit tokens per model separately (e.g., 50K GPT-4o + 30K Claude per period).
  • Background sync -- Events sync to the Paygent backend asynchronously without blocking your agent.
  • Local fallback -- Works fully offline with local SQLite. Events queue and sync when the backend is reachable.
  • Fail-open -- Paygent is designed not to break your agent. Every path that intercepts an LLM call is guarded with try/except and falls through to the original call on error.

Installation

# Core SDK
pip install paygent

# With LangChain support
pip install paygent[langchain]

# With CrewAI support
pip install paygent[crewai]

# Everything
pip install paygent[all]

Usage

Auto-Instrumentation

When Paygent.init() runs, it monkey-patches OpenAI and Anthropic SDK methods. Any subsequent call inside a paygent_context(user_id=...) block is automatically metered and guard-checked. No changes to the LLM call line itself.

import openai
from paygent import Paygent, paygent_context

pg = Paygent.init(api_key="pg_live_...")

with paygent_context(user_id="user_123"):
    # Automatically metered -- nothing else to do
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "..."}],
    )

Frameworks: LangChain, LangGraph, and CrewAI all call the OpenAI/Anthropic SDKs under the hood, so auto-instrumentation covers them with no extra wiring. Wrap framework entry points (e.g. chain.invoke(...)) in paygent_context(...) just like direct SDK calls.

Backend connectivity check at init

When you pass api_key=..., Paygent.init() runs a synchronous health probe against the backend (3-second timeout). The probe distinguishes unreachable from rejected API key and surfaces the result as a Python warning:

from paygent import Paygent, PaygentBackendUnreachable, PaygentAuthInvalid

pg = Paygent.init(api_key="pg_live_...")

# If the backend is unreachable, you'll see on stderr:
#   PaygentBackendUnreachable: Could not reach Paygent backend at
#   https://api.paygent.dev ... SDK is running in OFFLINE mode ...
#
# If the API key is rejected (401/403):
#   PaygentAuthInvalid: Paygent backend at https://... rejected the API key ...

# Programmatic check:
if not pg.backend_reachable:
    # decide how to handle offline mode — fallback UX, alert, etc.
    log.warning("Paygent running without backend — guardrails are local-only")

Fail fast for CI / production startup: pass strict_backend=True to turn the warning into a raised exception:

pg = Paygent.init(api_key="pg_live_...", strict_backend=True)
# → raises PaygentBackendUnreachable / PaygentAuthInvalid if probe fails

Suppress the warning (if you want the silent-offline behavior):

import warnings
warnings.filterwarnings("ignore", category=PaygentBackendUnreachable)

When to call start_session (optional)

The SDK auto-loads a user's session on first use — start_session() is not required. Call it explicitly only when you want to:

  • Pre-warm the cache to avoid the one backend round-trip on first call
  • Supply a plan config inline (useful in local-only mode, or as a fallback in case the backend is unreachable)
  • Fire on_session_start callbacks at a known moment (e.g. at request start)

In connected mode with plans configured on the Paygent backend, you can skip it entirely.

# Pre-warm (optional — just avoids latency on the first call)
pg.start_session("user_123")

Decorator

@pg.track(user_id="user_123")
def handle_request(query):
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
    )

# Dynamic user ID from a function argument
@pg.track(user_id_param="uid")
def handle_request(uid: str, query: str):
    return openai.chat.completions.create(...)

Explicit Wrap

For cases where you prefer explicit per-call control over monkey-patching:

import openai
client = openai.OpenAI()

# Sync: wrap() takes a ZERO-ARG CALLABLE
response = pg.wrap(
    lambda: client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    ),
    user_id="user_123",
    model="gpt-4o",
)
import openai
async_client = openai.AsyncOpenAI()

# Async: awrap() takes an AWAITABLE (the coroutine directly)
response = await pg.awrap(
    async_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    ),
    user_id="user_123",
    model="gpt-4o",
)

The model parameter is optional — Paygent extracts it from the response after the call. Note: per-model token-limit checks only apply when model is passed in, since the pre-call guard can't know which model's cap to check until you tell it.

You can also pass session_id, metadata, provider (explicit token extractor), and estimated_input_tokens / estimated_max_tokens (for better reservation sizing under concurrency).

Plan Configuration

Plans are normally configured on the Paygent dashboard/API and fetched by the SDK on session load. You only need to construct a PlanConfig in code for local-only mode (no API key) or as a fallback when the backend is unreachable.

from paygent import PlanConfig, ModelCostRate, ModelLimitConfig

plan_config = PlanConfig(
    max_spend_per_period=49.00,
    max_spend_per_session=5.00,
    soft_gate_at=0.80,      # Warn at 80% of any limit
    hard_gate_at=1.00,      # Block at 100% of any limit
    model_limits={
        "gpt-4o": ModelLimitConfig(max_tokens_per_period=50000),
        "claude-sonnet-4-20250514": ModelLimitConfig(max_tokens_per_period=30000),
    },
    cost_rates={
        "gpt-4o": ModelCostRate(input=0.0025, output=0.01),
        "claude-sonnet-4-20250514": ModelCostRate(input=0.003, output=0.015),
    },
    # Fallback rate for models not listed in cost_rates (opt-in)
    default_cost_rate=ModelCostRate(input=0.002, output=0.008),
    tool_costs={"web_search": 0.05},
    # --- Pre-call projection ---
    # When True, the guard projects (current + estimated) BEFORE the call
    # and blocks if it would overshoot a cap.  Without this, a user at 99%
    # can make a call that lands at 102% before the next guard fires.
    # See "Pre-call estimation" below for the full semantics.
    pre_call_estimate=False,
    pre_call_buffer_tokens=4096,
    # Safety margin applied to reservation estimates under concurrency.
    # Absorbs estimation drift (chars/4 tokenizer approximation, unknown
    # max_tokens, small race windows at cap boundaries).  Only affects the
    # TEMPORARY hold during the await — actual recorded spend is always
    # the real cost from the response.
    reservation_safety_factor=1.2,
)

A note on the limit matrix

Paygent intentionally splits limits across two units:

Unit Scope Field
Dollars Plan-wide (session + period) max_spend_per_period, max_spend_per_session
Tokens Per-model (period only) model_limits[name].max_tokens_per_period

Dollars control total cost. Tokens shape per-model behavior. There is no per-model dollar cap and no per-session token cap — these would be restatements of the same two concerns in different units, and mixing them creates ambiguity about which limit bites first under pricing drift.

If you want "$5/mo of Claude," express it as max_tokens_per_period = 5 / cost_rate. If you want per-session rate limiting on a specific model, max_spend_per_session combined with that model's cost rate already throttles it effectively.

How the guard check evaluates a call

The guard runs three independent checks before every LLM call. Each check is closed under its own unit (dollars vs tokens) — the SDK never compares dollars to tokens.

# Check Compared against Units
1 Period spend max_spend_per_period dollars
2 Session spend max_spend_per_session dollars
3 Per-model tokens model_limits[model].max_tokens_per_period tokens

Each check computes a percentage pct = projected / limit and decides:

pct >= hard_gate_at  (default 1.00)  →  hard_gate  →  PaygentLimitExceeded raised
pct >= soft_gate_at  (default 0.80)  →  soft_gate  →  callback fires, call proceeds
otherwise                            →  ok         →  call proceeds silently

So soft_gate_at and hard_gate_at are threshold percentages, not strict greater-than checks. A user at exactly 80% of their period cap fires the soft gate; a user at exactly 100% gets blocked.

How the guard picks a violation when multiple dimensions trip

When more than one of the three checks is in violation at the same time, the guard returns the most restrictive one — hard_gate beats soft_gate, and within the same severity the dimension closest to its cap (highest usage_pct) wins. The GuardResult.gate_reason always reflects the single tightest constraint so your callbacks / error messages can be maximally actionable.

Pre-call estimation

pre_call_estimate (default False) is a master switch that controls whether the guard projects this upcoming call into the check.

  • When False: The guard checks current_usage against limits. A user at 99% gets status="ok", the call fires, and may land at 102% — silent overrun until the next guard fires.
  • When True: The guard checks current_usage + projected_call against limits. The same user at 99% sees the projected 102% and gets hard-gated before the call, preventing the overrun.

The projection adds two numbers per check:

projected_cost   = current.cost   + reserved_cost   + estimated_cost
projected_tokens = current.tokens + reserved_tokens + total_est

where total_est = input_est + output_est:

  • input_estlen(prompt_chars) // 4, a rough chars-per-token heuristic.
  • output_est — your max_tokens kwarg if set, otherwise pre_call_buffer_tokens as a fallback.

pre_call_buffer_tokens (default 4096) is purely internal to Paygent's projection — it is never injected into your actual LLM call. If you don't pass max_tokens to chat.completions.create(), the LLM still generates unbounded output up to the model's context limit. The buffer only tells the guard "assume up to 4096 output tokens for the cap-projection math." If you want bounded output, set max_tokens yourself in the LLM call.

pre_call_buffer_tokens does nothing when pre_call_estimate=False — the field is read only inside the projection path.

When to enable pre_call_estimate:

  • ✅ Hard caps must never be crossed (regulated billing, prepaid plans).
  • ✅ Per-call cost is non-trivial relative to the cap (a single GPT-4 call can move you 5–10% of a tight budget).
  • ❌ Caps are loose and you'd rather avoid false positives at the boundary.
  • ❌ You always pass max_tokens AND your caps are large vs typical call cost — current-state checking is enough.

Tuning pre_call_buffer_tokens: too low → guard under-projects → calls that omit max_tokens slip past the cap. Too high → false-positive blocks for users with small calls. 4096 covers most gpt-4o-mini/Claude responses; bump to 8192/16384 for long-form generation workloads.

Guardrails

from paygent import PaygentLimitExceeded

# Register soft gate callback (approaching a limit)
def on_approaching_limit(result):
    print(f"Warning: {result.message}")
    # result.gate_reason: "total_spend", "session_spend", "model_limit:gpt-4o"

pg.on_soft_gate(on_approaching_limit)

# Register hard gate callback (fires before the exception is raised)
def on_limit_hit(result):
    log.error(f"Blocked: {result.message}")
    notify_user(result.gate_reason)

pg.on_hard_gate(on_limit_hit)

# Hard gates raise PaygentLimitExceeded
try:
    with paygent_context(user_id="user_123"):
        response = openai.chat.completions.create(...)
except PaygentLimitExceeded as e:
    print(f"Blocked: {e.guard_result.message}")

# Pre-flight check
guard = pg.check_guard("user_123", model="gpt-4o")
if guard.status == "hard_gate":
    print("User has exceeded their limit")

# Size max_tokens safely before the call — especially useful for streaming
# or any scenario where you want to bound output to what the user can afford.
advice = pg.get_max_tokens(
    "user_123",
    model="gpt-4o-mini",
    messages=my_messages,  # Paygent estimates input tokens from this
)
if advice.max_tokens == 0:
    return f"Budget exhausted: {advice.binding_limit}"
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=my_messages,
    max_tokens=advice.max_tokens,  # never pushes the user past any limit
)

Event Callbacks

# Called after every successfully metered LLM call
def on_usage(event):
    print(f"{event.model}: {event.total_tokens} tokens, ${event.cost_total:.4f}")

pg.on_usage(on_usage)

# Called when a user's session is first loaded (from backend / snapshot /
# permissive defaults)
def on_session(session):
    print(f"Session: {session.user_id} on plan {session.plan}")

pg.on_session_start(on_session)

Usage Queries

# Period + session totals (snapshot, auto-loads if not cached)
usage = pg.get_usage("user_123")
print(f"Period cost: ${usage.period_cost:.2f}")
print(f"Session cost: ${usage.session_cost:.2f}")
print(f"Period tokens: {usage.period_tokens_total}")

# Per-model breakdown
for m in pg.get_model_usage("user_123"):
    limit = f"/ {m.tokens_limit}" if m.tokens_limit else ""
    print(f"  {m.model}: {m.tokens_used} tokens {limit}, ${m.cost:.4f}")

# Multi-dimensional remaining budget — spend caps + per-model token caps.
# Dimensions with no configured limit are reported as float('inf') for
# spend fields or None for per-model token fields.
budget = pg.get_remaining_budget("user_123")
print(f"Most constrained: {budget.most_constrained}")
if budget.period_spend_remaining != float("inf"):
    print(f"Period remaining: ${budget.period_spend_remaining:.2f}")

# Quick "is the next call allowed?" boolean
if pg.is_within_limit("user_123", model="gpt-4o"):
    response = openai.chat.completions.create(...)

How It Works

Paygent adds negligible overhead per LLM call — typically single-digit milliseconds. Guard checks are in-memory operations held briefly under a per-user lock. Events are pushed to a non-blocking queue and flushed by a background thread. The call path is:

  1. Read context — which user is this call for?
  2. Guard check + reserve — held under a per-user lock; pre-call reservation prevents concurrent bursts from overrunning a cap.
  3. Execute the LLM call — lock released; network I/O runs in parallel with other calls for the same user.
  4. Meter + finalize — extract tokens from the response, update the cache (replacing the reservation with actual cost), push to the background event queue.

For the full architecture (event queue, SQLite schema, reservation semantics), see CONTRIBUTING.md.

Local Mode

Paygent supports two offline-ish scenarios — they're separate, and the SDK behaves differently in each.

Local-only mode (no backend at all)

Omit the API key to run without any backend. Everything works the same in the agent's hot path — guardrails, metering, per-model tracking — but events are stored in a local SQLite database and stay there. There's no backend to sync to.

pg = Paygent.init()  # No api_key = local-only
print(pg.is_local_mode)  # True

# Plans must be supplied in code since there's no backend to fetch from.
pg.start_session("user_123", plan="free", plan_config=PlanConfig(
    max_spend_per_period=5.00,
    cost_rates={"gpt-4o": ModelCostRate(input=0.0025, output=0.01)},
))

Good for tests, local development, demos.

Connected mode with offline fallback

When you pass api_key=... but the backend is transiently unreachable, Paygent degrades gracefully:

  • Guard checks continue running against the last-known cached state.
  • New events queue in the local SQLite database marked unsynced.
  • A background thread retries the sync on every sync_pending cycle (default every 30s).
  • When the backend returns, queued events flush to it automatically.

You don't need to do anything for this — it's automatic. Events are never lost due to transient backend outages.

The local database lives at ~/.paygent/local.db by default. Override via Paygent.init(db_path=...).

API Reference

Paygent

Method Description
Paygent.init(api_key=None, ...) Initialize the SDK (singleton)
pg.start_session(user_id, plan, plan_config) Optional — pre-warm a user's session (SDK auto-loads on first use)
pg.get_usage(user_id) Get current usage snapshot (auto-loads)
pg.get_model_usage(user_id) Get per-model breakdown
pg.get_remaining_budget(user_id) Multi-dimensional remaining budget (spend + per-model tokens)
pg.get_max_tokens(user_id, model, ...) Recommend a safe max_tokens value for the next call
pg.is_within_limit(user_id, model=None) Quick boolean: is the next call allowed?
pg.check_guard(user_id, model) Manual pre-flight guard check (returns GuardResult)
pg.on_soft_gate(callback) Register soft gate handler
pg.on_hard_gate(callback) Register hard gate handler
pg.on_usage(callback) Register post-metering handler
pg.on_session_start(callback) Register session start handler
pg.track(user_id=...) Decorator for user context
pg.wrap(call, user_id, model) Explicit metering wrapper (sync)
pg.awrap(coro, user_id, model) Explicit metering wrapper (async)
pg.backend_reachable Property: True if the init-time backend probe succeeded
pg.flush() Manually flush pending events
pg.shutdown() Graceful shutdown

Context Managers

Function Description
paygent_context(user_id, ...) Set user context for a block
paygent_track(user_id, ...) Decorator variant

Models

Model Description
PlanConfig Plan limits, cost rates, model limits
ModelCostRate Per-1K-token cost for a model (input + output)
ModelLimitConfig Per-model token cap within a plan
GuardResult Result of a guard check (ok/soft_gate/hard_gate)
UsageEvent A single metered event
CurrentUsage Live usage counters
ModelUsage Per-model tokens/cost snapshot
BudgetRemaining Remaining spend and per-model tokens (returned by get_remaining_budget)
MaxTokensAdvice Safe max_tokens recommendation (returned by get_max_tokens)
UserState Full cached state for a user (plan + usage + billing period)
BillingPeriod Subscription-anchored billing window
UserSession Deprecated alias for UserState (kept for backward compat)

Known Limitations

Multi-process / multi-replica deployments

Paygent keeps per-user usage in an in-memory cache per process and syncs events to the backend on a background timer. Guard checks (soft gate, hard gate, model limits) run against the local cache only — they do not round-trip to the backend on every LLM call.

When you run multiple worker processes (Gunicorn with workers > 1, multi-replica Kubernetes, multiple containers, etc.), each process has its own independent cache. The caches converge by periodic refresh from the backend (refresh_interval, default 60 seconds), but between refreshes they drift.

Practical impact: a user making concurrent requests that land on different workers can briefly exceed their configured limit. Maximum possible overspend per refresh window is roughly:

(workers - 1) × refresh_interval × request_rate × avg_cost_per_request

Example: 4 Gunicorn workers, 1 LLM req/sec, $0.01/req, 60s refresh → up to ~$1.80 overspend per user per minute in the worst case.

Mitigations (pick based on your needs):

  1. Single worker: run with --workers 1 if strict per-user enforcement is required and throughput is acceptable.
  2. Tighter refresh: pass Paygent.init(refresh_interval=10.0) — reduces drift by 6× at the cost of 6× more backend traffic.
  3. Generous plan buffer: configure hard gates with a safety margin (e.g. set hard gate at 90% of what you actually want to enforce) until shared-cache support lands.

Planned for Phase 2: shared-cache mode (Redis or lease-based budget) that removes this drift entirely while preserving the sub-millisecond guard-check latency of the local cache.

Contributing

See CONTRIBUTING.md for development setup, architecture details, testing, and release process.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paygent-0.1.5.tar.gz (165.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paygent-0.1.5-py3-none-any.whl (84.6 kB view details)

Uploaded Python 3

File details

Details for the file paygent-0.1.5.tar.gz.

File metadata

  • Download URL: paygent-0.1.5.tar.gz
  • Upload date:
  • Size: 165.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for paygent-0.1.5.tar.gz
Algorithm Hash digest
SHA256 8ea4478ac7d5facaf8a42ede64a4a6e8f44c8f0dcdc496c60c0d19c030971a89
MD5 91d5fb2cbc35a327a0f2424e01cd2684
BLAKE2b-256 f8f83fa1262f33ad2276c35a58ccc0d61db80e9aa8b6c5851cc9ec6f7ed217f5

See more details on using hashes here.

File details

Details for the file paygent-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: paygent-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 84.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for paygent-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 056df0fab666b75568c9310aba419eeabcd15f94493a10fbaecc353dda9df218
MD5 1b4c69ef811a8d5b09cc99737c5a901c
BLAKE2b-256 816f33f0f628d4cc07ed41c2ea3091dcbf9340185969a0b5fc95527e3d0038e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page