Skip to main content

cProfile for LLMs โ€” find which function is burning your AI budget. Flame graph output, zero-config, no proxy.

Project description

tokenspy ๐Ÿ”ฅ

The local-first LLM observability stack. No cloud. No signup. No proxy.

Cost profiling ยท Structured tracing ยท Evaluations ยท Prompt versioning ยท Live dashboard

PyPI version Tests Python 3.10+ License: MIT Zero dependencies

pip install tokenspy

The Problem

You get an OpenAI invoice. It says $800 this month. You have no idea which function caused it.

def run_pipeline(query):
    docs = fetch_and_summarize(query)    # โ† costs $600?
    entities = extract_entities(docs)   # โ† or this one?
    return generate_report(entities)    # โ† or this one?

Langfuse and Braintrust force you to reroute traffic through their cloud proxy. Sign up. Configure API keys. Break your local setup. Pay monthly.

tokenspy is your local alternative. One line. Runs entirely on your machine. Forever free.


What's New in v0.2.0 โ€” Full Observability Stack

tokenspy now covers everything Langfuse and Braintrust do โ€” without sending a single byte to the cloud.

Feature v0.1 v0.2.0
Cost flame graph โœ… โœ…
Budget alerts โœ… โœ…
SQLite persistence โœ… โœ…
Structured tracing (Trace + Span) โŒ โœ…
OpenTelemetry export โŒ โœ…
Evaluations + datasets โŒ โœ…
Prompt versioning โŒ โœ…
Live web dashboard โŒ โœ…

Feature 1: Cost Profiling (the original)

import tokenspy

@tokenspy.profile
def run_pipeline(query):
    docs = fetch_and_summarize(query)
    entities = extract_entities(docs)
    return generate_report(entities)

run_pipeline("Analyze Q3 earnings")
tokenspy.report()

Terminal output:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘  tokenspy cost report                                                โ•‘
โ•‘  total: $0.0523  ยท  18,734 tokens  ยท  3 calls                       โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘                                                                      โ•‘
โ•‘  fetch_and_summarize      $0.038  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘  73%             โ•‘
โ•‘    โ””โ”€ gpt-4o               $0.038  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘  73%            โ•‘
โ•‘       โ””โ”€ 12,000 tokens                                               โ•‘
โ•‘                                                                      โ•‘
โ•‘  generate_report          $0.011  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  21%            โ•‘
โ•‘    โ””โ”€ gpt-4o               $0.011  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  21%            โ•‘
โ•‘       โ””โ”€ 3,600 tokens                                                โ•‘
โ•‘                                                                      โ•‘
โ•‘  extract_entities         $0.003  โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘   6%            โ•‘
โ•‘    โ””โ”€ gpt-4o-mini          $0.003  โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘   6%            โ•‘
โ•‘       โ””โ”€ 3,134 tokens                                                โ•‘
โ•‘                                                                      โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  Optimization hints                                                  โ•‘
โ•‘                                                                      โ•‘
โ•‘  ๐Ÿ”ด fetch_and_summarize [gpt-4o]                                     โ•‘
โ•‘     Switch to gpt-4o-mini โ€” 94% cheaper  (~$540/month savings)      โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Now you know: fetch_and_summarize is burning 73% of your budget. Fix that one function.


Feature 2: Structured Tracing

See exactly what happens inside every LLM call โ€” inputs, outputs, tokens, latency โ€” organized into a tree of spans, just like Langfuse.

How it works

Your Code
    โ”‚
    โ”œโ”€โ”€ tokenspy.trace("research_pipeline")          โ† top-level trace
    โ”‚       โ”‚
    โ”‚       โ”œโ”€โ”€ tokenspy.span("retrieve_docs")        โ† child span
    โ”‚       โ”‚       โ””โ”€โ”€ vector_store.search(...)
    โ”‚       โ”‚
    โ”‚       โ”œโ”€โ”€ tokenspy.span("summarize", "llm")     โ† LLM span
    โ”‚       โ”‚       โ””โ”€โ”€ client.chat.completions.create(...)
    โ”‚       โ”‚               โ”‚
    โ”‚       โ”‚               โ””โ”€โ”€ tokenspy interceptor auto-links:
    โ”‚       โ”‚                     model, input_tokens, output_tokens,
    โ”‚       โ”‚                     cost_usd, duration_ms โ†’ span record
    โ”‚       โ”‚
    โ”‚       โ””โ”€โ”€ tokenspy.span("rank_results")
    โ”‚
    โ””โ”€โ”€ t.score("relevance", 0.92)                   โ† attach score

LLM calls made inside a span are automatically linked โ€” no manual wiring.

Code example

import tokenspy

tokenspy.init(persist=True)   # save traces to ~/.tokenspy/usage.db

with tokenspy.trace("research_pipeline", input={"query": "climate change"}) as t:

    with tokenspy.span("retrieve_docs", span_type="retrieval") as s:
        docs = vector_store.search("climate change", top_k=5)
        s.update(output={"n_docs": len(docs), "sources": [d.title for d in docs]})

    with tokenspy.span("summarize", span_type="llm") as s:
        # Any LLM call here is AUTOMATICALLY attributed to this span
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Summarize: {docs}"}]
        )
        s.update(output=response.choices[0].message.content)

    with tokenspy.span("rank_results", span_type="function") as s:
        ranked = rerank(docs, response)
        s.update(output=ranked[:3])

    t.update(output=ranked[:3])

# Attach quality scores after the fact
t.score("relevance", 0.92, scorer="human")
t.score("hallucination", 0.05, scorer="llm_judge", comment="Grounded in sources")

What gets recorded per span

Span: summarize
  โ”œโ”€โ”€ span_type:     llm
  โ”œโ”€โ”€ start_time:    2026-03-10 14:23:01.412
  โ”œโ”€โ”€ duration_ms:   842
  โ”œโ”€โ”€ model:         gpt-4o               โ† auto-linked from LLM call
  โ”œโ”€โ”€ input_tokens:  4,200                โ† auto-linked
  โ”œโ”€โ”€ output_tokens: 380                  โ† auto-linked
  โ”œโ”€โ”€ cost_usd:      $0.0144              โ† auto-linked
  โ”œโ”€โ”€ status:        ok
  โ””โ”€โ”€ output:        "Climate change refers to..."

Why tracing matters

Without tracing:

cost report: run_pipeline โ†’ $0.052 total, 3 calls

You know the total. You don't know which step took 800ms. You don't know what the retrieval returned. You can't replay it.

With tracing:

trace: research_pipeline  842ms  $0.052
  โ”œโ”€โ”€ retrieve_docs       12ms   $0.000  โ†’ returned 5 docs
  โ”œโ”€โ”€ summarize           810ms  $0.0144 โ†’ gpt-4o ยท 4,200 in ยท 380 out
  โ””โ”€โ”€ rank_results        8ms    $0.000  โ†’ [doc3, doc1, doc5]
  scores: relevance=0.92  hallucination=0.05

You see the full picture. You know retrieval was fast but the LLM was slow. You have the inputs and outputs for debugging. You can score the quality.

Nested spans and async

Works with nested spans and async code โ€” no changes needed:

# Async works identically
async def run():
    async with tokenspy.trace("async_pipeline") as t:
        async with tokenspy.span("step1") as s:
            result = await async_llm_call()
            s.update(output=result)

Feature 3: OpenTelemetry Export

Send tokenspy data to Grafana, Jaeger, Datadog, Honeycomb โ€” any OTEL-compatible backend:

tokenspy.init(
    persist=True,
    otel_endpoint="http://localhost:4317",   # your OTLP gRPC endpoint
    otel_service_name="my-llm-app",
)
pip install tokenspy[otel]

Every LLM call is exported as an OpenTelemetry span with standard attributes:

llm.openai.chat
  llm.request.model:           "gpt-4o"
  llm.usage.prompt_tokens:     4200
  llm.usage.completion_tokens: 380
  llm.usage.cost_usd:          0.0144
  code.function:               "summarize"

What this unlocks:

  • Grafana dashboard: cost per minute, P95 latency, error rate
  • Jaeger: distributed trace view across microservices
  • Datadog: alert when cost per request exceeds threshold
  • Any existing OTEL pipeline โ€” tokenspy plugs straight in

Feature 4: Evaluations + Datasets

Run your LLM functions against golden test sets and track quality over time โ€” like Braintrust, but local.

import tokenspy
from tokenspy.eval import scorers

tokenspy.init(persist=True)

# 1. Build a dataset
ds = tokenspy.dataset("qa-golden")
ds.add(input={"question": "Capital of France?"}, expected_output="Paris")
ds.add(input={"question": "Capital of Germany?"}, expected_output="Berlin")
ds.from_json("more_test_cases.json")   # bulk import

# 2. Define the function under test
@tokenspy.profile
def answer_question(input: dict) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": input["question"]}]
    )
    return response.choices[0].message.content.strip()

# 3. Run the experiment
exp = tokenspy.experiment(
    "gpt4o-mini-baseline",
    dataset="qa-golden",
    fn=answer_question,
    scorers=[scorers.exact_match, scorers.contains],
)
results = exp.run()
results.summary()

Terminal output:

tokenspy โ€” Experiment: gpt4o-mini-baseline
Dataset: qa-golden  (2 items)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  โœ“  Capital of France?     exact_match=1.0  contains=1.0  $0.0001  112ms
  โœ“  Capital of Germany?    exact_match=1.0  contains=1.0  $0.0001   98ms
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Passed:  2/2  (100.0%)
  Cost:    $0.0002
  Avg ms:  105

LLM-as-judge scoring

from tokenspy.eval import scorers

# Scores 0.0โ€“1.0 using a small model as judge
judge = scorers.llm_judge(
    criteria="Is the answer factually accurate and concise?",
    model="gpt-4o-mini",
)

exp = tokenspy.experiment(
    "accuracy-check",
    dataset="qa-golden",
    fn=answer_question,
    scorers=[scorers.exact_match, judge],
)
results = exp.run()

Compare experiments

# After a prompt change, compare against the baseline
results.compare("gpt4o-mini-baseline")
Experiment comparison: gpt4o-mini-v2  vs  gpt4o-mini-baseline
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  exact_match:   0.95  โ†’  0.80   โ–ผ 15%
  llm_judge:     0.88  โ†’  0.91   โ–ฒ  3%
  cost:       $0.0002  โ†’  $0.0003  โ–ฒ 50%
  pass rate:    100%  โ†’   80%    โ–ผ 20%
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Feature 5: Prompt Versioning

Track every version of every prompt. Know exactly which prompt version caused a cost spike or quality drop.

import tokenspy

tokenspy.init(persist=True)

# Push a new version (auto-increments: 1, 2, 3...)
p = tokenspy.prompts.push(
    "summarizer",
    "Summarize the following text in {{style}} style, max {{max_words}} words:\n\n{{text}}"
)
print(p.version)   # 1

# Compile with variables
compiled = p.compile(
    style="concise",
    max_words=100,
    text="Long document about climate change..."
)
# โ†’ "Summarize the following text in concise style, max 100 words:\n\nLong document..."

# Pull specific version or latest
p_latest = tokenspy.prompts.pull("summarizer")
p_v1     = tokenspy.prompts.pull("summarizer", version=1)
p_prod   = tokenspy.prompts.pull("summarizer", label="production")

# Mark a version as production
tokenspy.prompts.set_production("summarizer", version=2)

# List all prompts
tokenspy.prompts.list()
# [{"name": "summarizer", "version": 1, ...},
#  {"name": "summarizer", "version": 2, "is_production": True}, ...]

Why this matters: When you run an experiment, you know exactly which prompt version was active. When costs spike, you can diff v1 vs v2 and see what changed.


Feature 6: Live Web Dashboard

tokenspy serve
# โ†’ http://localhost:7234 (opens automatically)

tokenspy serve --port 8080 --db /path/to/custom.db
pip install tokenspy[server]

The dashboard has 5 tabs:

Overview โ€” cost/day bar chart, top functions by cost, model breakdown donut, live call counter (WebSocket push)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  tokenspy dashboard                    live  โ— 3 calls/min  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Overview    โ”‚  Cost per day (last 7 days)                   โ”‚
โ”‚  Traces      โ”‚  โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆ               โ”‚
โ”‚  Evals       โ”‚                                               โ”‚
โ”‚  Prompts     โ”‚  Top functions        Cost      % of total    โ”‚
โ”‚  Settings    โ”‚  fetch_and_summarize  $0.038    73%  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚
โ”‚              โ”‚  generate_report      $0.011    21%  โ–ˆโ–ˆโ–ˆโ–ˆ     โ”‚
โ”‚              โ”‚  extract_entities     $0.003     6%  โ–ˆ        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Traces โ€” browse every trace, click to expand the full span tree with inputs/outputs/scores

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Traces                               Filter: all           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  โ–ผ research_pipeline   842ms  $0.052  2026-03-10 14:23      โ”‚
โ”‚    โ”œโ”€ retrieve_docs    12ms   $0.000  retrieval             โ”‚
โ”‚    โ”œโ”€ summarize        810ms  $0.0144 llm ยท gpt-4o          โ”‚
โ”‚    โ””โ”€ rank_results     8ms    $0.000  function              โ”‚
โ”‚    scores: relevance=0.92  hallucination=0.05               โ”‚
โ”‚                                                             โ”‚
โ”‚  โ–ถ data_extraction     340ms  $0.021  2026-03-10 14:19      โ”‚
โ”‚  โ–ถ report_generation   190ms  $0.009  2026-03-10 14:15      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Evaluations โ€” run history, pass rates, score distributions per experiment

Prompts โ€” version history table, production flag, preview content

Settings โ€” DB path, OTEL status, version info


Quick Start

Minimal (1 line)

import tokenspy

@tokenspy.profile
def my_function():
    return openai.chat.completions.create(model="gpt-4o", messages=[...])

my_function()
tokenspy.report()

Full v0.2.0 setup

import tokenspy

tokenspy.init(
    persist=True,        # save everything to ~/.tokenspy/usage.db
    track_git=True,      # tag calls with git SHA
    otel_endpoint="http://localhost:4317",  # optional: export to Grafana/Jaeger
)

with tokenspy.trace("my_pipeline", input={"query": q}) as t:
    with tokenspy.span("retrieve") as s:
        docs = fetch(q)
        s.update(output=docs)
    with tokenspy.span("generate", span_type="llm") as s:
        answer = llm_call(docs)      # auto-linked to span
    t.update(output=answer)

t.score("quality", 0.9)

tokenspy.report()
tokenspy serve   # open dashboard at http://localhost:7234

CLI

# Call history
tokenspy history
tokenspy history --limit 50

# Reports
tokenspy report
tokenspy report --format html

# Cost diff
tokenspy compare --db before.db --db after.db
tokenspy compare --commit abc123 --commit def456

# GitHub Actions annotations
tokenspy annotate --current current.db --baseline baseline.db

# Live dashboard
tokenspy serve
tokenspy serve --port 8080 --no-open

Budget Alerts

@tokenspy.profile(budget_usd=0.10)
def my_agent(query): ...
# UserWarning: [tokenspy] Budget exceeded in my_agent: $0.1423 > $0.1000

@tokenspy.profile(budget_usd=0.10, on_exceeded="raise")
def strict_agent(query): ...
# raises BudgetExceededError (inherits BaseException โ€” propagates through SDK guards)

LangChain / LangGraph

from tokenspy.integrations.langchain import TokenspyCallbackHandler

chain.invoke(prompt, config={"callbacks": [TokenspyCallbackHandler()]})

# Works with LangGraph agents โ€” same callback system

GitHub Actions โ€” Cost Diff Per PR

from tokenspy.ci import annotate_cost_diff
annotate_cost_diff("current_run.db", "baseline.db")
::warning::fetch_and_summarize cost increased $0.031 (62.4%)
Function Cost vs Baseline
fetch_and_summarize $0.0812 โ–ฒ62.4%
extract_entities $0.0031 โ–ผ2.1%

How It Works

tokenspy monkey-patches the SDK client in-process โ€” the same technique as py-spy and line_profiler:

Your Code
    โ”‚
    โ”œโ”€โ”€ tokenspy.trace("pipeline") โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ opens trace context
    โ”‚       โ”‚
    โ”‚       โ””โ”€โ”€ tokenspy.span("step") โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ opens span context
    โ”‚               โ”‚
    โ”‚               โ””โ”€โ”€ openai.chat.completions.create(...)
    โ”‚                           โ”‚
    โ”‚                           โ””โ”€โ”€ tokenspy interceptor (monkey-patch)
    โ”‚                                   โ”œโ”€โ”€ calls original SDK method
    โ”‚                                   โ”œโ”€โ”€ reads response.usage
    โ”‚                                   โ”œโ”€โ”€ looks up cost in pricing table
    โ”‚                                   โ”œโ”€โ”€ records CallRecord in Tracker
    โ”‚                                   โ”œโ”€โ”€ auto-links to active span โ† NEW
    โ”‚                                   โ””โ”€โ”€ returns response unchanged

tokenspy.report()   โ†’ flame graph
tokenspy serve      โ†’ web dashboard

No proxy server. No HTTP interception. No environment variables. Your code runs exactly as before.


vs. Langfuse and Braintrust

Langfuse Braintrust tokenspy
Requires proxy / cloud โœ… cloud โœ… cloud โŒ fully local
Requires signup โœ… yes โœ… yes โŒ no
Data leaves your machine โœ… yes โœ… yes โŒ never
Works offline โŒ no โŒ no โœ… yes
Zero dependencies (core) โŒ no โŒ no โœ… yes
Structured tracing โœ… yes โœ… yes โœ… yes
Evaluations + datasets โœ… yes โœ… yes โœ… yes
LLM-as-judge scoring โœ… yes โœ… yes โœ… yes
Prompt versioning โœ… yes โœ… yes โœ… yes
OpenTelemetry export โšก partial โŒ no โœ… yes
Flame graph by function โŒ no โŒ no โœ… yes
@decorator API โŒ no โŒ no โœ… yes
Budget alerts โšก partial โšก partial โœ… yes
Git commit cost tracking โŒ no โŒ no โœ… yes
GitHub Actions cost diff โŒ no โŒ no โœ… yes
Optimization hints โŒ no โŒ no โœ… yes
Monthly cost $0โ€“$250 $0โ€“$300 free forever

tokenspy's unique advantages:

  • No proxy โ€” intercepts in-process, zero latency overhead, works with any network config
  • @decorator API โ€” profile any function with one line, no SDK changes
  • Flame graph โ€” visual cost breakdown by function, not just by model
  • Git tracking โ€” tag every call with commit SHA, compare costs across code versions
  • PR cost diffs โ€” catch cost regressions in CI before they ship

Supported Providers

Provider Package Auto-detected
OpenAI openai>=1.0 chat.completions.create (sync + async + streaming)
Anthropic anthropic>=0.30 messages.create (sync + async + streaming)
Google google-generativeai>=0.7 generate_content
LangChain langchain-core>=0.2 Callback handler (any model/provider)

Install Options

pip install tokenspy              # zero dependencies โ€” core profiling only
pip install tokenspy[openai]      # + openai SDK
pip install tokenspy[anthropic]   # + anthropic SDK
pip install tokenspy[langchain]   # + langchain-core
pip install tokenspy[otel]        # + OpenTelemetry export
pip install tokenspy[server]      # + web dashboard (fastapi + uvicorn)
pip install tokenspy[all]         # openai + anthropic + langchain

Built-in Pricing Table

30+ models, updated March 2026. No network calls.

Model Input $/1M Output $/1M
claude-opus-4-6 $15.00 $75.00
claude-sonnet-4-6 $3.00 $15.00
claude-haiku-4-5 $0.80 $4.00
gpt-4o $2.50 $10.00
gpt-4o-mini $0.15 $0.60
o1 $15.00 $60.00
gemini-1.5-pro $1.25 $5.00
gemini-1.5-flash $0.075 $0.30

โ†’ Full table


Contributing

git clone https://github.com/pinakimishra95/tokenspy
cd tokenspy
pip install -e ".[dev]"
pytest tests/    # 139 tests, ~0.3s

Issues and PRs welcome โ€” especially for new provider support and updated pricing.


License

MIT ยฉ Pinaki Mishra. See LICENSE.


Everything Langfuse and Braintrust do. Zero cloud. Zero signup. Zero cost.

GitHub ยท PyPI ยท Docs ยท Issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenspy-0.2.0.tar.gz (807.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenspy-0.2.0-py3-none-any.whl (60.3 kB view details)

Uploaded Python 3

File details

Details for the file tokenspy-0.2.0.tar.gz.

File metadata

  • Download URL: tokenspy-0.2.0.tar.gz
  • Upload date:
  • Size: 807.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for tokenspy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 57c8da6658673b9b5f5c889c7effa590e7fe54bb7de725267af755685e016880
MD5 64b1397a4fee90bccad32458e1f91252
BLAKE2b-256 8ed2b15f50a85441e0a788436849360595fe97edf655ef25a852546de166d770

See more details on using hashes here.

File details

Details for the file tokenspy-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tokenspy-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 60.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for tokenspy-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b3c07a2edcb2494eca55229005da94a078d918e33d450e816a7b085b4b5e0356
MD5 1ef6b57b50091c3dc1b1d2eb348e1918
BLAKE2b-256 ec3ef8b50eab7f02e3ed43ada3b18fceaf9063688c6840ac48f3e09b68806c49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page