Skip to main content

Track LLM model usage and compute live gross margin with Tollgate.

Project description

Tollgate

tollgateai

Real-time gross-margin observability for AI-powered products.
Track every LLM call's cost, attribute it to a customer, and know whether you're making money — before the invoice goes out.

pypi downloads python zero deps license

Dashboard · TypeScript SDK · Quick Start · API Reference


Why Tollgate?

AI products bill customers on plans (per ticket, per seat, usage-based) but pay providers per token. Tollgate joins the two in real time — giving you per-customer, per-agent, per-run gross margin the moment each LLM call completes.

  • 2-line integration — wrap your provider client once; every call is tracked automatically.
  • Zero dependencies — uses only urllib and threading from the Python standard library.
  • Non-blocking — usage reporting fires on a daemon thread. Failures never raise into your application code.
  • Privacy-first — no prompt content is ever transmitted. Only token counts, model identifiers, and metadata.
  • Universal coverage — Anthropic, OpenAI, Google Gemini, AWS Bedrock, and every OpenAI-compatible gateway.
┌──────────────┐    ┌───────────────┐    ┌────────────────┐
│  Your App    │───>│ LLM Provider  │───>│   Provider     │
│  (SDK wrap)  │<───│ (Anthropic,   │<───│   Response     │
│              │    │  OpenAI, …)   │    │  (tokens, id)  │
└──────┬───────┘    └───────────────┘    └────────────────┘
       │
       │  POST /api/track (background daemon thread)
       v
┌─────────────────────────────────────────────────────┐
│  Tollgate Server                                    │
│                                                     │
│  ┌─────────────┐ ┌───────────┐ ┌─────────────────┐  │
│  │ Rate Card   │ │ Plan      │ │ Margin Rollups  │  │
│  │ (1,500+     │ │ Revenue   │ │ (per customer,  │  │
│  │  models)    │ │ Config    │ │  agent, run)    │  │
│  └─────────────┘ └───────────┘ └─────────────────┘  │
└─────────────────────────────────────────────────────┘

Installation

pip install tollgateai

Requirements: Python 3.8+ · Zero dependencies · Standard library only (urllib, threading)


Quick Start

from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()          # reads TOLLGATE_API_KEY from env
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

# Every call is tracked automatically — tokens, cost, latency, tool calls.
msg = anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Resolve this billing dispute..."}],
)

# Close the run and book revenue.
tollgate.resolve(
    run_id="ticket_8842",
    customer_id="cust_acme",
    outcome="resolved",
    revenue_unit_cents=50,       # $0.50 per resolved ticket
)

Provider Support

Provider Wrapper Streaming Extracted Fields
Anthropic wrap_anthropic Automatic Input/output tokens, cache read/write, web search requests, tool calls, latency
OpenAI wrap_openai stream_options={"include_usage": True} Input/output tokens, reasoning, cached, audio in/out, text in/out, prediction tokens, service tier, tool calls, latency
Google Gemini wrap_gemini Automatic Input/output tokens, thinking, cached, audio/image/video per-modality, web search (grounding), tool calls, latency
OpenAI-compatible wrap_openai + provider="openai_compatible" Same as OpenAI Same as OpenAI + gateway-reported cost (when available)
AWS Bedrock wrap_bedrock Automatic Input/output tokens, cache read/write (per-TTL split), tool calls, latency

Configuration

Environment Variables

Variable Required Default Description
TOLLGATE_API_KEY Yes Your account API key (tg_live_…)
TOLLGATE_BASE_URL No https://www.tollgateai.dev Self-hosted deployment URL

Programmatic Configuration

tollgate = create_tollgate_client(
    api_key="tg_live_xxx",
    base_url="https://www.tollgateai.dev",
    timeout=10.0,       # per-request timeout in seconds (default 10)
    max_retries=2,      # retries on 5xx/429/network (default 2)
)

Auto-Instrumentation

Wrap your provider client once. Every create / generate_content / converse call reports usage on a background daemon thread — non-blocking, fire-and-forget. Failures go to on_error (default: logger.warning) and never raise into your application code.

Anthropic

from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize this ticket..."}],
)

OpenAI

from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
openai = wrap_openai(OpenAI(), tollgate, customer_id="cust_acme")

openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

Google Gemini

import google.generativeai as genai
from tollgate import create_tollgate_client, wrap_gemini

genai.configure(api_key=GEMINI_API_KEY)
tollgate = create_tollgate_client()
model = wrap_gemini(
    genai.GenerativeModel("gemini-2.0-flash"),
    tollgate,
    customer_id="cust_acme",
)

response = model.generate_content("Explain quantum computing")

OpenAI-Compatible Gateways

Works with any OpenAI-compatible endpoint — OpenRouter, Groq, Together, Nebius, Vercel AI Gateway, local vLLM, and more.

from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
groq = wrap_openai(
    OpenAI(api_key=GROQ_KEY, base_url="https://api.groq.com/openai/v1"),
    tollgate,
    customer_id="cust_acme",
    provider="openai_compatible",
)

groq.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}],
)

When a gateway returns cost inline (e.g. OpenRouter's usage.cost), the SDK captures it automatically as providerCostCents. The server uses it verbatim, bypassing the rate card. Gateways that don't return cost fall through to rate-card pricing. An explicit provider_cost_cents in the wrapper options always takes precedence.

AWS Bedrock

import boto3
from tollgate import create_tollgate_client, wrap_bedrock

tollgate = create_tollgate_client()
bedrock = wrap_bedrock(
    boto3.client("bedrock-runtime", region_name="us-east-1"),
    tollgate,
    customer_id="cust_acme",
)

bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[{"role": "user", "content": [{"text": "Hello"}]}],
)

Streaming

Streaming is captured automatically. Iterate the stream as usual — usage and latency are reported when the stream ends.

OpenAI / compatible requires stream_options={"include_usage": True}. Anthropic, Gemini, and Bedrock need no extra flags.

stream = openai.chat.completions.create(
    model="gpt-4o",
    stream=True,
    stream_options={"include_usage": True},
    messages=[{"role": "user", "content": "Hello"}],
)
for chunk in stream:
    pass  # render to UI
# Usage + latency reported automatically when stream ends.

Tracked Fields

Every auto-instrumented call captures these fields from the provider response:

Field Providers Description
tokensIn All Input tokens (deduplicated — excludes cached/audio for OpenAI; excludes cached/audio/image/video for Gemini)
tokensOut All Output tokens (deduplicated — excludes reasoning/audio for OpenAI; excludes audio/image for Gemini)
reasoningTokens OpenAI, Gemini Reasoning/thinking tokens (billed at reasoning rate)
cachedTokens All Prompt cache read tokens (reduced rate)
cacheWrite5mTokens Anthropic, Bedrock Cache creation tokens (5-minute TTL)
cacheWrite1hTokens Bedrock Cache creation tokens (1-hour TTL)
audioTokensIn / Out OpenAI, Gemini Audio modality tokens (GPT-4o audio, Gemini multimodal)
imageTokensIn / Out Gemini Image/vision input and generation output tokens
videoTokensIn Gemini Video input tokens
textTokensIn / Out OpenAI, Gemini Text-only modality tokens
webSearchRequests Anthropic, Gemini Web search requests (server tools / grounding)
acceptedPredictionTokens OpenAI Predicted Outputs: accepted tokens
rejectedPredictionTokens OpenAI Predicted Outputs: rejected (waste) tokens
serviceTier OpenAI Service tier (default, flex, priority)
latencyMs All SDK-measured request duration in milliseconds
toolCalls All Number of tool calls in the response
providerCostCents OpenAI-compatible Gateway-reported cost — used verbatim, bypasses rate card
model All Model identifier as reported by the provider

Cost is computed server-side from token counts and a rate card that auto-syncs daily from the LiteLLM registry (1,500+ models). Rate cards include per-token pricing for every modality, cache tier, reasoning, and web search. Unknown models are priced at $0 and flagged in logs.


Provider Field Coverage

Anthropic — Messages API
Anthropic API Field SDK Field Notes
usage.input_tokens tokensIn Input tokens (excludes cached)
usage.output_tokens tokensOut Output tokens (includes reasoning — billed at output rate)
usage.cache_read_input_tokens cachedTokens Prompt cache read tokens
usage.cache_creation_input_tokens cacheWrite5mTokens Prompt cache creation tokens
usage.server_tool_use.web_search_requests webSearchRequests Web search server tool requests
response.content[] (type tool_use) toolCalls Count of tool-use content blocks
(SDK-measured) latencyMs Request duration

Anthropic bills reasoning tokens at the output rate. The SDK reports the full output_tokens count; the server-side rate card applies the matching output rate.

In streaming mode, message_start carries input/cache counts and message_delta carries the output count. The SDK accumulates both automatically.

OpenAI — Chat Completions API
OpenAI API Field SDK Field Notes
usage.prompt_tokens tokensIn Minus cached and audio tokens to prevent double-billing
usage.completion_tokens tokensOut Minus reasoning and audio tokens to prevent double-billing
usage.completion_tokens_details.reasoning_tokens reasoningTokens Reasoning/thinking tokens
usage.prompt_tokens_details.cached_tokens cachedTokens Prompt cache read tokens
usage.prompt_tokens_details.audio_tokens audioTokensIn Audio input tokens
usage.completion_tokens_details.audio_tokens audioTokensOut Audio output tokens
usage.prompt_tokens_details.text_tokens textTokensIn Text modality input tokens
usage.completion_tokens_details.text_tokens textTokensOut Text modality output tokens
usage.completion_tokens_details.accepted_prediction_tokens acceptedPredictionTokens Predicted Outputs: accepted
usage.completion_tokens_details.rejected_prediction_tokens rejectedPredictionTokens Predicted Outputs: rejected
service_tier serviceTier Service tier used
choices[].message.tool_calls toolCalls Tool call count
(SDK-measured) latencyMs Request duration

OpenAI's prompt_tokens and completion_tokens are totals that include sub-category tokens. The SDK subtracts each sub-category so every token is costed at exactly one rate.

Google Gemini — Generative AI / Vertex AI
Google API Field SDK Field Notes
usageMetadata.promptTokenCount tokensIn Minus cached, audio, image, video to prevent double-billing
usageMetadata.candidatesTokenCount tokensOut Minus audio and image output (thinking is already excluded by Google)
usageMetadata.thoughtsTokenCount reasoningTokens Thinking/reasoning tokens (Gemini 2.x)
usageMetadata.cachedContentTokenCount cachedTokens Prompt cache read tokens
promptTokensDetails[AUDIO] audioTokensIn Audio input modality
candidatesTokensDetails[AUDIO] audioTokensOut Audio output modality
promptTokensDetails[IMAGE] imageTokensIn Image/vision input
candidatesTokensDetails[IMAGE] imageTokensOut Image generation output
promptTokensDetails[VIDEO] videoTokensIn Video input
promptTokensDetails[TEXT] textTokensIn Text input
candidatesTokensDetails[TEXT] textTokensOut Text output
candidates[].groundingMetadata.webSearchQueries webSearchRequests Google Search grounding
candidates[].content.parts[].functionCall toolCalls Function call count
(SDK-measured) latencyMs Request duration

Google's candidatesTokenCount does not include thoughtsTokenCount, so reasoning tokens are not subtracted. However, it does include audio and image output tokens, so the SDK subtracts those to prevent double-billing.

The Python SDK handles both snake_case (usage_metadata, prompt_token_count) and camelCase (usageMetadata, promptTokenCount) response formats — compatible with both the official google-generativeai SDK and the REST API.

AWS Bedrock — Converse API
Bedrock API Field SDK Field Notes
usage.inputTokens tokensIn Input tokens
usage.outputTokens tokensOut Output tokens (includes reasoning — Bedrock does not split)
usage.cacheReadInputTokens cachedTokens Prompt cache read tokens
usage.cacheDetails[ttl="5m"] cacheWrite5mTokens Cache creation (5-minute TTL)
usage.cacheDetails[ttl="1h"] cacheWrite1hTokens Cache creation (1-hour TTL, higher rate)
output.message.content[].toolUse toolCalls Tool-use content block count
(SDK-measured) latencyMs Request duration

Bedrock's cacheDetails array provides per-TTL breakdowns. The SDK splits these into cacheWrite5mTokens and cacheWrite1hTokens. When cacheDetails is absent, cacheWriteInputTokens falls back to the 5m bucket.

In streaming mode (converse_stream), the final metadata event carries usage totals. Tool calls are accumulated from contentBlockStart events.


Pricing Models

Per-Call Revenue

For simple per-call billing, pass revenue_unit_cents in the wrapper options:

anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    revenue_unit_cents=50,  # $0.50 earned per LLM call
)

Outcome-Based Pricing

Under per-resolution pricing, only a resolved run earns revenue. Escalated or failed runs earn $0, but provider costs still count against margin.

run_id = "ticket_8842"
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id=run_id,
)

# ... multiple LLM calls within this run ...

tollgate.resolve(
    run_id=run_id,
    customer_id="cust_acme",
    outcome="resolved",        # "resolved" | "escalated" | "failed"
    revenue_unit_cents=50,
)

External Tool Costs

Report costs from non-LLM services (image generation, code sandboxes, search APIs) alongside LLM calls:

tollgate.track({
    "customerId": "cust_acme",
    "runId": "ticket_8842",
    "provider": "openai",
    "model": "dall-e-3",
    "tokensIn": 0,
    "tokensOut": 0,
    "externalCostCents": 4.0,     # $0.04 for the DALL-E call
    "idempotencyKey": "ticket_8842#dalle",
})

Customer & Plan Setup

Create customers and assign plans before sending usage so plan-priced revenue is recognized from the first event. Idempotent — safe to call on every app boot.

tollgate.upsert_customer(
    "cust_acme",
    name="Acme Corp",
    plan={
        "name": "Pro Plan",
        "pricingModel": "usage_based",   # per_unit | per_resolution | usage_based | per_seat | flat | hybrid
        "unitRevenueCents": 10,
    },
)

Error Handling

The SDK separates tracking errors (non-fatal) from client errors (actionable):

import logging
from tollgate import create_tollgate_client, wrap_anthropic, TollgateError

# Tracking errors are logged as warnings by default.
# Override with on_error to route to your observability stack:
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    on_error=lambda err: sentry_sdk.capture_exception(err),
)

# Client errors (missing API key, invalid plan) raise TollgateError:
try:
    tollgate.upsert_customer("cust_acme")
except TollgateError as err:
    print(err.status, err.body)  # HTTP status + response body

Retry behavior: The client retries on 5xx and 429 responses with exponential backoff (200ms, 400ms, ...). Deterministic 4xx errors (400, 401, 403, 404, 422) fail immediately.

Logging: The SDK uses the standard logging module under the "tollgate" logger name. Configure it as you would any Python logger:

logging.getLogger("tollgate").setLevel(logging.DEBUG)

API Reference

Exports

# Client
create_tollgate_client(api_key?, base_url?, timeout?, max_retries?)  # -> TollgateClient
TollgateError                    # Exception with status & body

# Auto-instrumentation wrappers
wrap_anthropic(client, tollgate, customer_id, **kwargs)   # -> instrumented Anthropic client
wrap_openai(client, tollgate, customer_id, **kwargs)      # -> instrumented OpenAI / compatible client
wrap_bedrock(client, tollgate, customer_id, **kwargs)     # -> instrumented Bedrock client
wrap_gemini(model, tollgate, customer_id, **kwargs)       # -> instrumented Gemini model

# Low-level event builders (for manual track payloads)
anthropic_event_from(msg, customer_id, **kwargs)          # -> dict | None
openai_event_from(completion, customer_id, **kwargs)      # -> dict | None
bedrock_event_from(usage, model, customer_id, **kwargs)   # -> dict | None
gemini_event_from(response, customer_id, **kwargs)        # -> dict | None

TollgateClient

Method Description
track(event: dict) Report a single usage event. Idempotent on idempotencyKey. Returns {"status", "eventId"}.
resolve(run_id, customer_id, outcome, ...) Close a run with an outcome. Books revenue only when outcome == "resolved".
upsert_customer(customer_id, ...) Create or update a customer and optionally assign a plan. Returns {"status", "customerId", "id", "planId"}.

create_tollgate_client Parameters

Parameter Type Default Description
api_key str TOLLGATE_API_KEY env Account API key
base_url str https://www.tollgateai.dev Tollgate server URL
timeout float 10.0 Per-request timeout in seconds
max_retries int 2 Retry attempts on 5xx / 429 / network errors

Wrapper Parameters

Parameter Type Required Description
customer_id str Yes Your end customer's stable identifier
agent_id str No Agent or workflow identifier
run_id str | Callable No Logical run ID (defaults to provider response ID)
provider str No Override the reported provider
revenue_unit_cents int | Callable No Revenue per call in cents
provider_cost_cents float | Callable No Exact cost override in cents (skips rate card)
on_error Callable No Error handler for background tracking (default: logger.warning)

How It Works

  1. Proxy wrappers intercept provider calls without modifying the request or response. Your code sees the exact same types and behavior as without the SDK.
  2. After the provider responds, the wrapper extracts token counts by modality, tool calls, service tier, and latency from the response object.
  3. A POST /api/track fires on a background daemon thread with automatic retries on transient failures. Your application code continues immediately.
  4. The server computes cost from tokens via rate cards (per modality, cache tier, reasoning, and web search), joins it with plan-configured revenue, and updates real-time margin rollups.
  5. Events are idempotent — deduplication is based on idempotencyKey (auto-set to the provider response ID).

Security & Privacy

  • No prompt content is ever transmitted. Only token counts, model identifiers, and metadata.
  • Idempotent ingestion — duplicate events are safely deduplicated server-side.
  • Non-invasive — background tracking never raises into your application code.
  • Transport security — all communication over HTTPS with Bearer token authentication.
  • Thread-safe — all wrappers and the client are safe for concurrent use.

License

MIT — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tollgateai-0.9.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tollgateai-0.9.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file tollgateai-0.9.0.tar.gz.

File metadata

  • Download URL: tollgateai-0.9.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tollgateai-0.9.0.tar.gz
Algorithm Hash digest
SHA256 11db9fda348adf626a52d9a602cdd1d32aa51041c7f4b29dd12de4bcd7dec436
MD5 8f8006cdeff534da9eab12fefe36a1c8
BLAKE2b-256 ff1c1176a8979ae0c4f876cf1b9ea663485d28ab6b44a7b57d691a961cfea5e0

See more details on using hashes here.

File details

Details for the file tollgateai-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: tollgateai-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tollgateai-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d6d40465c6af8c368a00f987ca6a91cae211a68bd2cac2a1ec0035c4bbd415f
MD5 8b89df51c3886ddcfa6e3952e8566fe0
BLAKE2b-256 e382a16d2b68d50cc9fd49bc8e6cf29b90f5923ced943dadd33720c410c49b81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page