Skip to main content

Track LLM model usage and compute live gross margin with Tollgate.

Project description

tollgateai

Real-time gross-margin observability for AI agents. Track every LLM call's cost, attribute it to a customer, and see whether you're making money — before the invoice goes out.

v0.6.0 · PyPI · Dashboard


Why Tollgate

You sell an AI-powered product. Each customer interaction triggers LLM calls that cost you real money — input tokens, output tokens, reasoning tokens, audio tokens, cached tokens, web searches, tool calls. Tollgate captures that cost automatically from provider responses, joins it with the revenue your pricing model defines, and shows you per-customer, per-agent, per-run gross margin in real time.

Installation

pip install tollgateai

Requires Python 3.8+. Zero dependencies — uses only urllib and threading from the standard library.

Quick Start

from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()          # reads TOLLGATE_API_KEY from env
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

# Every call is tracked automatically — tokens, cost, latency, tool calls.
msg = anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Resolve this billing dispute..."}],
)

# Close the run and book revenue.
tollgate.resolve(
    run_id="ticket_8842",
    customer_id="cust_acme",
    outcome="resolved",
    revenue_unit_cents=50,       # $0.50 per resolved ticket
)

Provider Support

Provider Wrapper Streaming What Gets Extracted
Anthropic wrap_anthropic Automatic Tokens, thinking/reasoning, cache (read + write by TTL), web search requests, tool calls, latency
OpenAI wrap_openai stream_options={"include_usage": True} Tokens, reasoning, cached, audio in/out, text in/out, prediction tokens, service tier, tool calls, latency
Google Gemini wrap_gemini Automatic Tokens, thinking, cached, audio/image/video per-modality, web search (grounding), tool calls, latency
OpenAI-compatible wrap_openai + provider="openai_compatible" Same as OpenAI Same as OpenAI
AWS Bedrock wrap_bedrock Automatic Tokens, cache (read + write), tool calls, latency

Configuration

Environment Variable Required Default
TOLLGATE_API_KEY Yes
TOLLGATE_BASE_URL No https://tollgateai.vercel.app

Or pass them directly:

tollgate = create_tollgate_client(
    api_key="tg_live_xxx",
    base_url="https://tollgateai.vercel.app",
    timeout=10.0,       # per-request timeout in seconds (default 10)
    max_retries=2,      # retries on 5xx/429/network (default 2)
)

Auto-Instrumentation

Wrap your provider client once. Every create / generate_content call reports usage in the background — non-blocking on a daemon thread. Failures go to on_error (default: logger.warning) and never break your LLM call.

Anthropic

from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize this ticket..."}],
)

OpenAI

from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
openai = wrap_openai(OpenAI(), tollgate, customer_id="cust_acme")

openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

Google Gemini

import google.generativeai as genai
from tollgate import create_tollgate_client, wrap_gemini

genai.configure(api_key=GEMINI_API_KEY)
tollgate = create_tollgate_client()
model = wrap_gemini(
    genai.GenerativeModel("gemini-2.0-flash"),
    tollgate,
    customer_id="cust_acme",
)

response = model.generate_content("Explain quantum computing")

OpenAI-Compatible Gateways

Point the OpenAI SDK at any compatible endpoint and pass provider="openai_compatible":

from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
groq = wrap_openai(
    OpenAI(api_key=GROQ_KEY, base_url="https://api.groq.com/openai/v1"),
    tollgate,
    customer_id="cust_acme",
    provider="openai_compatible",
)

groq.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}],
)

AWS Bedrock

import boto3
from tollgate import create_tollgate_client, wrap_bedrock

tollgate = create_tollgate_client()
bedrock = wrap_bedrock(
    boto3.client("bedrock-runtime", region_name="us-east-1"),
    tollgate,
    customer_id="cust_acme",
)

bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[{"role": "user", "content": [{"text": "Hello"}]}],
)

Streaming

Streaming is captured automatically. Iterate the stream as usual — usage and latency are reported when the stream ends.

OpenAI / compatible requires stream_options={"include_usage": True}. Anthropic, Gemini, and Bedrock need no extra flags.

stream = openai.chat.completions.create(
    model="gpt-4o",
    stream=True,
    stream_options={"include_usage": True},
    messages=[{"role": "user", "content": "Hello"}],
)
for chunk in stream:
    pass  # render to UI
# Usage + latency reported automatically when stream ends.

What Gets Tracked

Every auto-instrumented call captures these fields from the provider response:

Field Providers Description
tokensIn All Input tokens consumed
tokensOut All Output tokens generated
reasoningTokens OpenAI, Anthropic, Gemini Reasoning/thinking tokens (billed at reasoning rate)
cachedTokens All Prompt cache read tokens (reduced rate)
cacheWrite5mTokens Anthropic, Bedrock 5-min TTL cache creation tokens
cacheWrite1hTokens Anthropic 1-hour TTL cache creation tokens
audioTokensIn OpenAI Audio input tokens (GPT-4o audio / Realtime)
audioTokensOut OpenAI, Gemini Audio output tokens
imageTokensIn Gemini Image/vision input tokens
imageTokensOut Gemini Image generation output tokens
videoTokensIn Gemini Video input tokens
textTokensIn OpenAI, Gemini Text-only input tokens (modality split)
textTokensOut OpenAI, Gemini Text-only output tokens
webSearchRequests Anthropic, Gemini Web search requests (server tools / grounding)
acceptedPredictionTokens OpenAI Predicted Outputs: accepted tokens
rejectedPredictionTokens OpenAI Predicted Outputs: rejected tokens (waste)
serviceTier OpenAI Service tier used (default, flex, priority)
latencyMs All SDK-measured request duration in milliseconds
toolCalls All Number of tool calls in the response
model All Model identifier as reported by the provider

Cost is computed server-side from token counts and a rate card that auto-syncs daily from the LiteLLM registry (1,500+ models). Rate cards include per-token pricing for text, audio, image, video, cache, reasoning, and web search. Unknown models are priced at $0 and flagged in logs.


Outcome-Based Pricing

Under per-resolution pricing, only a resolved run earns revenue. An escalated or failed run earns $0 but its provider cost still counts.

run_id = "ticket_8842"
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id=run_id,
)

# ... multiple LLM calls within this run ...

tollgate.resolve(
    run_id=run_id,
    customer_id="cust_acme",
    outcome="resolved",        # "resolved" | "escalated" | "failed"
    revenue_unit_cents=50,
)

For simple per-call billing, pass revenue_unit_cents in the wrap options and skip resolve().


External Tool Costs

Report costs from external services (image generation, code sandboxes, search APIs) alongside LLM calls:

tollgate.track({
    "customerId": "cust_acme",
    "runId": "ticket_8842",
    "provider": "openai",
    "model": "gpt-4o",
    "tokensIn": 500,
    "tokensOut": 200,
    "externalCostCents": 4.0,     # $0.04 for the DALL-E call
    "idempotencyKey": "ticket_8842#step_2",
})

Customer & Plan Setup

Create customers and assign plans before sending usage so plan-priced revenue is recognized from the first event. Idempotent.

tollgate.upsert_customer(
    "cust_acme",
    name="Acme Corp",
    plan={
        "name": "Pro Plan",
        "pricingModel": "usage_based",   # per_unit | per_resolution | usage_based | per_seat | flat | hybrid
        "unitRevenueCents": 10,
    },
)

API Reference

Exports

# Client
create_tollgate_client(api_key?, base_url?, timeout?, max_retries?)  # -> TollgateClient
TollgateError                    # Exception with status & body

# Auto-instrumentation wrappers
wrap_anthropic(client, tollgate, customer_id, **kwargs)   # -> instrumented Anthropic client
wrap_openai(client, tollgate, customer_id, **kwargs)      # -> instrumented OpenAI / compatible client
wrap_bedrock(client, tollgate, customer_id, **kwargs)     # -> instrumented Bedrock client
wrap_gemini(model, tollgate, customer_id, **kwargs)       # -> instrumented Gemini model

# Low-level event builders (for manual track payloads)
anthropic_event_from(msg, customer_id, **kwargs)          # -> dict | None
openai_event_from(completion, customer_id, **kwargs)      # -> dict | None
bedrock_event_from(usage, model, customer_id, **kwargs)   # -> dict | None
gemini_event_from(response, customer_id, **kwargs)        # -> dict | None

TollgateClient

Method Description
track(event) Report a single usage event. Idempotent on idempotencyKey.
resolve(run_id, customer_id, outcome, ...) Close a run with an outcome. Books revenue only when outcome is "resolved".
upsert_customer(customer_id, ...) Create or update a customer and optionally assign a plan.

Wrapper Parameters

Parameter Type Required Description
customer_id str Yes Your end customer's stable identifier
agent_id str No Agent or workflow identifier
run_id str | Callable No Logical run ID (defaults to provider response ID)
provider str No Override the reported provider
revenue_unit_cents int | Callable No Revenue per call in cents
provider_cost_cents float | Callable No Exact cost override (skips rate card)
on_error Callable No Error handler for background tracking

How It Works

  1. Proxy wrappers intercept provider calls without modifying the request or response.
  2. After the provider responds, the wrapper extracts token counts (by modality), tool calls, service tier, and latency from the response.
  3. A POST /api/track fires on a background daemon thread with automatic retries on transient failures.
  4. The server computes cost from tokens via rate cards (text, audio, image, video, cache, reasoning, web search), joins it with plan-configured revenue, and updates real-time margin rollups.
  5. Events are idempotent on idempotencyKey (auto-set to the provider response ID).

Privacy & Security

  • No prompt content is ever sent. Only token counts, model name, and metadata.
  • Events are deduplicated server-side.
  • Background tracking never raises into your application code.

What's New in v0.6.0

  • Fix: Anthropic thinking token extractionoutput_tokens_details.thinking_tokens is now extracted and costed at the reasoning rate instead of the output rate. Previously, thinking tokens from extended thinking (Sonnet 4.x, Opus 4.x) were invisible to cost computation.
  • Fix: OpenAI double-countingcompletion_tokens includes reasoning and audio sub-totals; these are now subtracted from tokensOut so each token is costed at exactly one rate. Previously, reasoning tokens were billed at both the output rate and the reasoning rate.
  • Fix: OpenAI input double-countingprompt_tokens includes cached and audio sub-totals; these are now subtracted from tokensIn. Previously, cached tokens were billed at both the full input rate and the cached rate.
  • Fix: Multimodal-only events — audio, image, video, and web search events now trigger rate-card lookup even when text token counts are zero.
  • reasoningTokens is now extracted from all three providers: OpenAI, Anthropic, and Gemini.

v0.5.0

  • Google Gemini / Vertex AI support (wrap_gemini) with full multimodal extraction
  • Audio token tracking (OpenAI GPT-4o audio / Realtime API)
  • Image & video token tracking (Gemini per-modality breakdowns)
  • Web search request tracking (Anthropic server_tool_use, Gemini grounding)
  • Latency measurement on all wrappers (SDK-measured latencyMs)
  • OpenAI Predicted Outputs (acceptedPredictionTokens / rejectedPredictionTokens)
  • Service tier tracking (OpenAI flex / priority, Anthropic priority)
  • Text modality split for accurate cost attribution in mixed-modal requests
  • Expanded rate card sync: audio, image, video, and web search rates from LiteLLM

License

Licensed for use with Tollgate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tollgateai-0.6.0.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tollgateai-0.6.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file tollgateai-0.6.0.tar.gz.

File metadata

  • Download URL: tollgateai-0.6.0.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tollgateai-0.6.0.tar.gz
Algorithm Hash digest
SHA256 fa556d3a8a6fad346120bf29f879ff4fb6056c9c85b4538984952dbd8fa3f1e0
MD5 5d3664957bbbadacc3f580dd68fd239f
BLAKE2b-256 80386ce58d393f6474e178cdc716a9a7a40388b4660295c4d7f7febbb0c8d8c3

See more details on using hashes here.

File details

Details for the file tollgateai-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: tollgateai-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tollgateai-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 772f6d783d234c9c5d22dff8b8cf147248553e9cb2d543952a6df4c1f7f53650
MD5 e65aa9576d35a54b31c83785d0b2f662
BLAKE2b-256 4e2788ff7fe0c3cc5915613fd6d1a5c2c94c66c7651e32b3dbfd63247e49c25d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page