Skip to main content

Track real LLM model usage and compute live gross margin with Tollgate.

Project description

tollgateai

Real-time gross-margin observability for AI agents. Track every LLM call's cost, attribute it to a customer, and see whether you're making money — before the invoice goes out.

v0.3.0 · PyPI · Dashboard


Why Tollgate

You sell an AI-powered product. Each customer interaction triggers LLM calls that cost you real money — input tokens, output tokens, reasoning tokens, cached tokens, tool calls. Tollgate captures that cost automatically from provider responses, joins it with the revenue your pricing model defines, and shows you per-customer, per-agent, per-run gross margin in real time.

Installation

pip install tollgateai

Requires Python 3.8+. Zero dependencies — uses only urllib and threading from the standard library.

Quick Start

from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()          # reads TOLLGATE_API_KEY from env
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

# Every call is tracked automatically — tokens, cost, tool calls.
msg = anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Resolve this billing dispute…"}],
)

# Close the run and book revenue.
tollgate.resolve(
    run_id="ticket_8842",
    customer_id="cust_acme",
    outcome="resolved",
    revenue_unit_cents=50,       # $0.50 per resolved ticket
)

Provider Support

Provider Wrapper Streaming Tool-Call Tracking
Anthropic wrap_anthropic Automatic Counts tool_use content blocks
OpenAI wrap_openai Needs stream_options={"include_usage": True} Counts tool_calls on choices
OpenAI-compatible (Groq, OpenRouter, Together, Nebius, vLLM, …) wrap_openai with provider="openai_compatible" Same as OpenAI Same as OpenAI
AWS Bedrock wrap_bedrock Automatic Counts toolUse content blocks

Configuration

Environment Variable Required Default
TOLLGATE_API_KEY Yes
TOLLGATE_BASE_URL No https://tollgateai.vercel.app

Or pass them directly:

tollgate = create_tollgate_client(
    api_key="tg_live_xxx",
    base_url="https://tollgateai.vercel.app",
    timeout=10.0,       # per-request timeout in seconds (default 10)
    max_retries=2,      # retries on 5xx/429/network (default 2)
)

Auto-Instrumentation

Wrap your provider client once. Every create / converse call reports usage in the background — non-blocking on a daemon thread. Failures go to on_error (default: logger.warning) and never break your LLM call.

Anthropic

from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize this ticket…"}],
)

OpenAI

from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
openai = wrap_openai(OpenAI(), tollgate, customer_id="cust_acme")

openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

OpenAI-Compatible Gateways

Point the OpenAI SDK at any compatible endpoint and pass provider="openai_compatible":

from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
groq = wrap_openai(
    OpenAI(api_key=GROQ_KEY, base_url="https://api.groq.com/openai/v1"),
    tollgate,
    customer_id="cust_acme",
    provider="openai_compatible",
)

groq.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}],
)

AWS Bedrock

import boto3
from tollgate import create_tollgate_client, wrap_bedrock

tollgate = create_tollgate_client()
bedrock = wrap_bedrock(
    boto3.client("bedrock-runtime", region_name="us-east-1"),
    tollgate,
    customer_id="cust_acme",
)

bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[{"role": "user", "content": [{"text": "Hello"}]}],
)

Streaming

Streaming is captured automatically — iterate the stream as usual and usage is reported when the stream ends.

OpenAI / compatible requires stream_options={"include_usage": True} for the final usage chunk. Anthropic and Bedrock need no extra flags.

stream = openai.chat.completions.create(
    model="gpt-4o",
    stream=True,
    stream_options={"include_usage": True},
    messages=[{"role": "user", "content": "Hello"}],
)
for chunk in stream:
    pass  # render to UI
# Usage reported automatically when stream ends.

What Gets Tracked

Every auto-instrumented call captures the following from the provider response:

Field Source Description
tokensIn usage.input_tokens / prompt_tokens Input tokens consumed
tokensOut usage.output_tokens / completion_tokens Output tokens generated
reasoningTokens completion_tokens_details.reasoning_tokens Reasoning/chain-of-thought tokens (OpenAI)
cachedTokens cache_read_input_tokens / cached_tokens Prompt cache read tokens
cacheWrite5mTokens cache_creation_input_tokens 5-min TTL cache write tokens
cacheWrite1hTokens cache_creation.ephemeral_1h_input_tokens 1-hour TTL cache write tokens
toolCalls Content block / choice inspection Number of tool calls in the response
provider Wrapper default or override anthropic, openai, openai_compatible, bedrock
model Response object Model identifier as reported by the provider

Cost is computed server-side from token counts and a rate card that auto-syncs daily from the public LiteLLM registry. Unknown models are priced at $0 and flagged in logs.


Outcome-Based Pricing

Under per-resolution pricing, only a resolved run earns revenue. An escalated or failed run earns $0 but its provider cost still counts. The pattern:

  1. Wrap to meter cost on every LLM call (automatic).
  2. Resolve once at the end to book the outcome.
run_id = "ticket_8842"
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id=run_id,
)

# … multiple LLM calls within this run …

tollgate.resolve(
    run_id=run_id,
    customer_id="cust_acme",
    outcome="resolved",        # "resolved" | "escalated" | "failed"
    revenue_unit_cents=50,
)

For simple per-call billing, pass revenue_unit_cents in the wrap options and skip resolve().


Customer & Plan Setup

Create customers and assign plans before sending usage so plan-priced revenue is recognized from the first event. Idempotent — safe to run on every boot.

tollgate.upsert_customer(
    "cust_acme",
    name="Acme Corp",
    company="Acme Corp",
    seats=5,
    plan={
        "name": "Pro Plan",
        "pricingModel": "usage_based",   # per_unit | per_resolution | usage_based | per_seat | flat | hybrid
        "unitRevenueCents": 10,
    },
)

Manual Tracking

For full control, unusual providers, or non-LLM cost events:

tollgate.track({
    "customerId": "cust_acme",
    "runId": "run_12345",
    "provider": "anthropic",
    "model": "claude-sonnet-4-6",
    "tokensIn": 1200,
    "tokensOut": 450,
    "reasoningTokens": 0,
    "cachedTokens": 0,
    "toolCalls": 2,
    "revenueUnitCents": 50,
    "idempotencyKey": "run_12345#step_1",
})

Already have an exact cost?

Pass provider_cost_cents (a number or a callable of the response) and the server uses it verbatim, skipping the rate card entirely:

anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    provider_cost_cents=3.5,   # or: lambda response: compute_my_own_cost(response)
)

API Reference

Exports

# Client
create_tollgate_client(api_key?, base_url?, timeout?, max_retries?)  # → TollgateClient
TollgateError                    # Exception with status & body

# Auto-instrumentation wrappers
wrap_anthropic(client, tollgate, customer_id, **kwargs)   # → instrumented Anthropic client
wrap_openai(client, tollgate, customer_id, **kwargs)      # → instrumented OpenAI / compatible client
wrap_bedrock(client, tollgate, customer_id, **kwargs)     # → instrumented Bedrock client

# Low-level event builders (for manual track payloads)
anthropic_event_from(msg, customer_id, **kwargs)          # → dict | None
openai_event_from(completion, customer_id, **kwargs)      # → dict | None
bedrock_event_from(usage, model, customer_id, **kwargs)   # → dict | None

TollgateClient

Method Description
track(event) Report a single usage event. Idempotent on idempotencyKey.
resolve(run_id, customer_id, outcome, ...) Close a run with an outcome. Books revenue only when outcome is "resolved".
upsert_customer(customer_id, ...) Create or update a customer and optionally assign a plan.

Wrapper Options

Parameter Type Required Description
customer_id str Yes Your end customer's stable identifier.
agent_id str No Agent or workflow identifier.
run_id str | Callable No Logical run ID. Defaults to the provider response ID.
provider str No Override the reported provider (e.g. "openai_compatible").
revenue_unit_cents int | Callable No Revenue per call in cents.
provider_cost_cents float | Callable No Exact cost override — skips rate card.
on_error Callable No Error handler for background tracking (default: logger.warning).

How It Works

  1. Proxy wrappers intercept messages.create / chat.completions.create / converse without modifying the request or response.
  2. After the provider responds, the wrapper extracts token counts, tool call counts, and metadata from the response's usage object and content blocks.
  3. A POST /api/track is fired on a background daemon thread — non-blocking, with automatic retries on transient failures.
  4. The server computes cost from tokens via rate cards, joins it with your plan-configured revenue, and updates real-time margin rollups.
  5. Events are idempotent on idempotencyKey (auto-set to the provider response ID), so retries and stream replays never double-count.

Privacy & Security

  • No prompt content is ever sent. Only token counts, model name, and metadata.
  • Events are deduplicated server-side — safe to retry.
  • Background tracking never raises into your application code.

License

Licensed for use with Tollgate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tollgateai-0.3.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tollgateai-0.3.0-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file tollgateai-0.3.0.tar.gz.

File metadata

  • Download URL: tollgateai-0.3.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tollgateai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 927fc6730c3b30e67415d1f3b524782944c635c020d15153069542d1f75398b2
MD5 9f658741ed7e88bd4babe295a4b1ac89
BLAKE2b-256 e1688f64dab03b5ed2f9aef429e99470a0d4f2f84ba6cdfd260ff1502831586e

See more details on using hashes here.

File details

Details for the file tollgateai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: tollgateai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tollgateai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d943bd8253aa0fa30f58c2096460c1722fadb1d25dc00c757a4e6ff5a491f01c
MD5 76337ab61a11408c6ad8046271262109
BLAKE2b-256 09448a18827377d4ed5a22f0da58a83e2c88450e202ced62092cf4c5c21b9ab9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page