cProfile for LLMs โ find which function is burning your AI budget. Flame graph output, zero-config, no proxy.
Project description
tokenspy ๐ฅ
The local-first LLM observability stack. No cloud. No signup. No proxy.
Cost profiling ยท Structured tracing ยท Evaluations ยท Prompt versioning ยท Live dashboard
pip install tokenspy
The Problem
You get an OpenAI invoice. It says $800 this month. You have no idea which function caused it.
def run_pipeline(query):
docs = fetch_and_summarize(query) # โ costs $600?
entities = extract_entities(docs) # โ or this one?
return generate_report(entities) # โ or this one?
Langfuse and Braintrust force you to reroute traffic through their cloud proxy. Sign up. Configure API keys. Break your local setup. Pay monthly.
tokenspy is your local alternative. One line. Runs entirely on your machine. Forever free.
What's New in v0.2.0 โ Full Observability Stack
tokenspy now covers everything Langfuse and Braintrust do โ without sending a single byte to the cloud.
| Feature | v0.1 | v0.2.0 |
|---|---|---|
| Cost flame graph | โ | โ |
| Budget alerts | โ | โ |
| SQLite persistence | โ | โ |
| Structured tracing (Trace + Span) | โ | โ |
| OpenTelemetry export | โ | โ |
| Evaluations + datasets | โ | โ |
| Prompt versioning | โ | โ |
| Live web dashboard | โ | โ |
Feature 1: Cost Profiling (the original)
import tokenspy
@tokenspy.profile
def run_pipeline(query):
docs = fetch_and_summarize(query)
entities = extract_entities(docs)
return generate_report(entities)
run_pipeline("Analyze Q3 earnings")
tokenspy.report()
Terminal output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ tokenspy cost report โ
โ total: $0.0523 ยท 18,734 tokens ยท 3 calls โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ โ
โ fetch_and_summarize $0.038 โโโโโโโโโโโโโโโโ 73% โ
โ โโ gpt-4o $0.038 โโโโโโโโโโโโโโโโ 73% โ
โ โโ 12,000 tokens โ
โ โ
โ generate_report $0.011 โโโโโโโโโโโโโโโโ 21% โ
โ โโ gpt-4o $0.011 โโโโโโโโโโโโโโโโ 21% โ
โ โโ 3,600 tokens โ
โ โ
โ extract_entities $0.003 โโโโโโโโโโโโโโโโ 6% โ
โ โโ gpt-4o-mini $0.003 โโโโโโโโโโโโโโโโ 6% โ
โ โโ 3,134 tokens โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ Optimization hints โ
โ โ
โ ๐ด fetch_and_summarize [gpt-4o] โ
โ Switch to gpt-4o-mini โ 94% cheaper (~$540/month savings) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Now you know: fetch_and_summarize is burning 73% of your budget. Fix that one function.
Feature 2: Structured Tracing
See exactly what happens inside every LLM call โ inputs, outputs, tokens, latency โ organized into a tree of spans, just like Langfuse.
How it works
Your Code
โ
โโโ tokenspy.trace("research_pipeline") โ top-level trace
โ โ
โ โโโ tokenspy.span("retrieve_docs") โ child span
โ โ โโโ vector_store.search(...)
โ โ
โ โโโ tokenspy.span("summarize", "llm") โ LLM span
โ โ โโโ client.chat.completions.create(...)
โ โ โ
โ โ โโโ tokenspy interceptor auto-links:
โ โ model, input_tokens, output_tokens,
โ โ cost_usd, duration_ms โ span record
โ โ
โ โโโ tokenspy.span("rank_results")
โ
โโโ t.score("relevance", 0.92) โ attach score
LLM calls made inside a span are automatically linked โ no manual wiring.
Code example
import tokenspy
tokenspy.init(persist=True) # save traces to ~/.tokenspy/usage.db
with tokenspy.trace("research_pipeline", input={"query": "climate change"}) as t:
with tokenspy.span("retrieve_docs", span_type="retrieval") as s:
docs = vector_store.search("climate change", top_k=5)
s.update(output={"n_docs": len(docs), "sources": [d.title for d in docs]})
with tokenspy.span("summarize", span_type="llm") as s:
# Any LLM call here is AUTOMATICALLY attributed to this span
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Summarize: {docs}"}]
)
s.update(output=response.choices[0].message.content)
with tokenspy.span("rank_results", span_type="function") as s:
ranked = rerank(docs, response)
s.update(output=ranked[:3])
t.update(output=ranked[:3])
# Attach quality scores after the fact
t.score("relevance", 0.92, scorer="human")
t.score("hallucination", 0.05, scorer="llm_judge", comment="Grounded in sources")
What gets recorded per span
Span: summarize
โโโ span_type: llm
โโโ start_time: 2026-03-10 14:23:01.412
โโโ duration_ms: 842
โโโ model: gpt-4o โ auto-linked from LLM call
โโโ input_tokens: 4,200 โ auto-linked
โโโ output_tokens: 380 โ auto-linked
โโโ cost_usd: $0.0144 โ auto-linked
โโโ status: ok
โโโ output: "Climate change refers to..."
Why tracing matters
Without tracing:
cost report: run_pipeline โ $0.052 total, 3 calls
You know the total. You don't know which step took 800ms. You don't know what the retrieval returned. You can't replay it.
With tracing:
trace: research_pipeline 842ms $0.052
โโโ retrieve_docs 12ms $0.000 โ returned 5 docs
โโโ summarize 810ms $0.0144 โ gpt-4o ยท 4,200 in ยท 380 out
โโโ rank_results 8ms $0.000 โ [doc3, doc1, doc5]
scores: relevance=0.92 hallucination=0.05
You see the full picture. You know retrieval was fast but the LLM was slow. You have the inputs and outputs for debugging. You can score the quality.
Nested spans and async
Works with nested spans and async code โ no changes needed:
# Async works identically
async def run():
async with tokenspy.trace("async_pipeline") as t:
async with tokenspy.span("step1") as s:
result = await async_llm_call()
s.update(output=result)
Feature 3: OpenTelemetry Export
Send tokenspy data to Grafana, Jaeger, Datadog, Honeycomb โ any OTEL-compatible backend:
tokenspy.init(
persist=True,
otel_endpoint="http://localhost:4317", # your OTLP gRPC endpoint
otel_service_name="my-llm-app",
)
pip install tokenspy[otel]
Every LLM call is exported as an OpenTelemetry span with standard attributes:
llm.openai.chat
llm.request.model: "gpt-4o"
llm.usage.prompt_tokens: 4200
llm.usage.completion_tokens: 380
llm.usage.cost_usd: 0.0144
code.function: "summarize"
What this unlocks:
- Grafana dashboard: cost per minute, P95 latency, error rate
- Jaeger: distributed trace view across microservices
- Datadog: alert when cost per request exceeds threshold
- Any existing OTEL pipeline โ tokenspy plugs straight in
Feature 4: Evaluations + Datasets
Run your LLM functions against golden test sets and track quality over time โ like Braintrust, but local.
import tokenspy
from tokenspy.eval import scorers
tokenspy.init(persist=True)
# 1. Build a dataset
ds = tokenspy.dataset("qa-golden")
ds.add(input={"question": "Capital of France?"}, expected_output="Paris")
ds.add(input={"question": "Capital of Germany?"}, expected_output="Berlin")
ds.from_json("more_test_cases.json") # bulk import
# 2. Define the function under test
@tokenspy.profile
def answer_question(input: dict) -> str:
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": input["question"]}]
)
return response.choices[0].message.content.strip()
# 3. Run the experiment
exp = tokenspy.experiment(
"gpt4o-mini-baseline",
dataset="qa-golden",
fn=answer_question,
scorers=[scorers.exact_match, scorers.contains],
)
results = exp.run()
results.summary()
Terminal output:
tokenspy โ Experiment: gpt4o-mini-baseline
Dataset: qa-golden (2 items)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Capital of France? exact_match=1.0 contains=1.0 $0.0001 112ms
โ Capital of Germany? exact_match=1.0 contains=1.0 $0.0001 98ms
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Passed: 2/2 (100.0%)
Cost: $0.0002
Avg ms: 105
LLM-as-judge scoring
from tokenspy.eval import scorers
# Scores 0.0โ1.0 using a small model as judge
judge = scorers.llm_judge(
criteria="Is the answer factually accurate and concise?",
model="gpt-4o-mini",
)
exp = tokenspy.experiment(
"accuracy-check",
dataset="qa-golden",
fn=answer_question,
scorers=[scorers.exact_match, judge],
)
results = exp.run()
Compare experiments
# After a prompt change, compare against the baseline
results.compare("gpt4o-mini-baseline")
Experiment comparison: gpt4o-mini-v2 vs gpt4o-mini-baseline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
exact_match: 0.95 โ 0.80 โผ 15%
llm_judge: 0.88 โ 0.91 โฒ 3%
cost: $0.0002 โ $0.0003 โฒ 50%
pass rate: 100% โ 80% โผ 20%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Feature 5: Prompt Versioning
Track every version of every prompt. Know exactly which prompt version caused a cost spike or quality drop.
import tokenspy
tokenspy.init(persist=True)
# Push a new version (auto-increments: 1, 2, 3...)
p = tokenspy.prompts.push(
"summarizer",
"Summarize the following text in {{style}} style, max {{max_words}} words:\n\n{{text}}"
)
print(p.version) # 1
# Compile with variables
compiled = p.compile(
style="concise",
max_words=100,
text="Long document about climate change..."
)
# โ "Summarize the following text in concise style, max 100 words:\n\nLong document..."
# Pull specific version or latest
p_latest = tokenspy.prompts.pull("summarizer")
p_v1 = tokenspy.prompts.pull("summarizer", version=1)
p_prod = tokenspy.prompts.pull("summarizer", label="production")
# Mark a version as production
tokenspy.prompts.set_production("summarizer", version=2)
# List all prompts
tokenspy.prompts.list()
# [{"name": "summarizer", "version": 1, ...},
# {"name": "summarizer", "version": 2, "is_production": True}, ...]
Why this matters: When you run an experiment, you know exactly which prompt version was active. When costs spike, you can diff v1 vs v2 and see what changed.
Feature 6: Live Web Dashboard
tokenspy serve
# โ http://localhost:7234 (opens automatically)
tokenspy serve --port 8080 --db /path/to/custom.db
pip install tokenspy[server]
The dashboard has 5 tabs:
Overview โ cost/day bar chart, top functions by cost, model breakdown donut, live call counter (WebSocket push)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ tokenspy dashboard live โ 3 calls/min โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Overview โ Cost per day (last 7 days) โ
โ Traces โ โโโโ โโ โโโ โโโโโ โโ โโโโ โโโ โ
โ Evals โ โ
โ Prompts โ Top functions Cost % of total โ
โ Settings โ fetch_and_summarize $0.038 73% โโโโโโโโ โ
โ โ generate_report $0.011 21% โโโโ โ
โ โ extract_entities $0.003 6% โ โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Traces โ browse every trace, click to expand the full span tree with inputs/outputs/scores
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Traces Filter: all โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โผ research_pipeline 842ms $0.052 2026-03-10 14:23 โ
โ โโ retrieve_docs 12ms $0.000 retrieval โ
โ โโ summarize 810ms $0.0144 llm ยท gpt-4o โ
โ โโ rank_results 8ms $0.000 function โ
โ scores: relevance=0.92 hallucination=0.05 โ
โ โ
โ โถ data_extraction 340ms $0.021 2026-03-10 14:19 โ
โ โถ report_generation 190ms $0.009 2026-03-10 14:15 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Evaluations โ run history, pass rates, score distributions per experiment
Prompts โ version history table, production flag, preview content
Settings โ DB path, OTEL status, version info
Quick Start
Minimal (1 line)
import tokenspy
@tokenspy.profile
def my_function():
return openai.chat.completions.create(model="gpt-4o", messages=[...])
my_function()
tokenspy.report()
Full v0.2.0 setup
import tokenspy
tokenspy.init(
persist=True, # save everything to ~/.tokenspy/usage.db
track_git=True, # tag calls with git SHA
otel_endpoint="http://localhost:4317", # optional: export to Grafana/Jaeger
)
with tokenspy.trace("my_pipeline", input={"query": q}) as t:
with tokenspy.span("retrieve") as s:
docs = fetch(q)
s.update(output=docs)
with tokenspy.span("generate", span_type="llm") as s:
answer = llm_call(docs) # auto-linked to span
t.update(output=answer)
t.score("quality", 0.9)
tokenspy.report()
tokenspy serve # open dashboard at http://localhost:7234
CLI
# Call history
tokenspy history
tokenspy history --limit 50
# Reports
tokenspy report
tokenspy report --format html
# Cost diff
tokenspy compare --db before.db --db after.db
tokenspy compare --commit abc123 --commit def456
# GitHub Actions annotations
tokenspy annotate --current current.db --baseline baseline.db
# Live dashboard
tokenspy serve
tokenspy serve --port 8080 --no-open
Budget Alerts
@tokenspy.profile(budget_usd=0.10)
def my_agent(query): ...
# UserWarning: [tokenspy] Budget exceeded in my_agent: $0.1423 > $0.1000
@tokenspy.profile(budget_usd=0.10, on_exceeded="raise")
def strict_agent(query): ...
# raises BudgetExceededError (inherits BaseException โ propagates through SDK guards)
LangChain / LangGraph
from tokenspy.integrations.langchain import TokenspyCallbackHandler
chain.invoke(prompt, config={"callbacks": [TokenspyCallbackHandler()]})
# Works with LangGraph agents โ same callback system
GitHub Actions โ Cost Diff Per PR
from tokenspy.ci import annotate_cost_diff
annotate_cost_diff("current_run.db", "baseline.db")
::warning::fetch_and_summarize cost increased $0.031 (62.4%)
| Function | Cost | vs Baseline |
|---|---|---|
fetch_and_summarize |
$0.0812 | โฒ62.4% |
extract_entities |
$0.0031 | โผ2.1% |
How It Works
tokenspy monkey-patches the SDK client in-process โ the same technique as py-spy and line_profiler:
Your Code
โ
โโโ tokenspy.trace("pipeline") โโโโโโโโโโโโโโโโ opens trace context
โ โ
โ โโโ tokenspy.span("step") โโโโโโโโโโโโโโ opens span context
โ โ
โ โโโ openai.chat.completions.create(...)
โ โ
โ โโโ tokenspy interceptor (monkey-patch)
โ โโโ calls original SDK method
โ โโโ reads response.usage
โ โโโ looks up cost in pricing table
โ โโโ records CallRecord in Tracker
โ โโโ auto-links to active span โ NEW
โ โโโ returns response unchanged
tokenspy.report() โ flame graph
tokenspy serve โ web dashboard
No proxy server. No HTTP interception. No environment variables. Your code runs exactly as before.
vs. Langfuse and Braintrust
| Langfuse | Braintrust | tokenspy | |
|---|---|---|---|
| Requires proxy / cloud | โ cloud | โ cloud | โ fully local |
| Requires signup | โ yes | โ yes | โ no |
| Data leaves your machine | โ yes | โ yes | โ never |
| Works offline | โ no | โ no | โ yes |
| Zero dependencies (core) | โ no | โ no | โ yes |
| Structured tracing | โ yes | โ yes | โ yes |
| Evaluations + datasets | โ yes | โ yes | โ yes |
| LLM-as-judge scoring | โ yes | โ yes | โ yes |
| Prompt versioning | โ yes | โ yes | โ yes |
| OpenTelemetry export | โก partial | โ no | โ yes |
| Flame graph by function | โ no | โ no | โ yes |
@decorator API |
โ no | โ no | โ yes |
| Budget alerts | โก partial | โก partial | โ yes |
| Git commit cost tracking | โ no | โ no | โ yes |
| GitHub Actions cost diff | โ no | โ no | โ yes |
| Optimization hints | โ no | โ no | โ yes |
| Monthly cost | $0โ$250 | $0โ$300 | free forever |
tokenspy's unique advantages:
- No proxy โ intercepts in-process, zero latency overhead, works with any network config
@decoratorAPI โ profile any function with one line, no SDK changes- Flame graph โ visual cost breakdown by function, not just by model
- Git tracking โ tag every call with commit SHA, compare costs across code versions
- PR cost diffs โ catch cost regressions in CI before they ship
Supported Providers
| Provider | Package | Auto-detected |
|---|---|---|
| OpenAI | openai>=1.0 |
chat.completions.create (sync + async + streaming) |
| Anthropic | anthropic>=0.30 |
messages.create (sync + async + streaming) |
google-generativeai>=0.7 |
generate_content |
|
| LangChain | langchain-core>=0.2 |
Callback handler (any model/provider) |
Install Options
pip install tokenspy # zero dependencies โ core profiling only
pip install tokenspy[openai] # + openai SDK
pip install tokenspy[anthropic] # + anthropic SDK
pip install tokenspy[langchain] # + langchain-core
pip install tokenspy[otel] # + OpenTelemetry export
pip install tokenspy[server] # + web dashboard (fastapi + uvicorn)
pip install tokenspy[all] # openai + anthropic + langchain
Built-in Pricing Table
30+ models, updated March 2026. No network calls.
| Model | Input $/1M | Output $/1M |
|---|---|---|
| claude-opus-4-6 | $15.00 | $75.00 |
| claude-sonnet-4-6 | $3.00 | $15.00 |
| claude-haiku-4-5 | $0.80 | $4.00 |
| gpt-4o | $2.50 | $10.00 |
| gpt-4o-mini | $0.15 | $0.60 |
| o1 | $15.00 | $60.00 |
| gemini-1.5-pro | $1.25 | $5.00 |
| gemini-1.5-flash | $0.075 | $0.30 |
Contributing
git clone https://github.com/pinakimishra95/tokenspy
cd tokenspy
pip install -e ".[dev]"
pytest tests/ # 139 tests, ~0.3s
Issues and PRs welcome โ especially for new provider support and updated pricing.
License
MIT ยฉ Pinaki Mishra. See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenspy-0.2.0.tar.gz.
File metadata
- Download URL: tokenspy-0.2.0.tar.gz
- Upload date:
- Size: 807.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57c8da6658673b9b5f5c889c7effa590e7fe54bb7de725267af755685e016880
|
|
| MD5 |
64b1397a4fee90bccad32458e1f91252
|
|
| BLAKE2b-256 |
8ed2b15f50a85441e0a788436849360595fe97edf655ef25a852546de166d770
|
File details
Details for the file tokenspy-0.2.0-py3-none-any.whl.
File metadata
- Download URL: tokenspy-0.2.0-py3-none-any.whl
- Upload date:
- Size: 60.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3c07a2edcb2494eca55229005da94a078d918e33d450e816a7b085b4b5e0356
|
|
| MD5 |
1ef6b57b50091c3dc1b1d2eb348e1918
|
|
| BLAKE2b-256 |
ec3ef8b50eab7f02e3ed43ada3b18fceaf9063688c6840ac48f3e09b68806c49
|