Skip to main content

pprof for your LLM context - see where every token and dollar goes.

Project description

llmprof

pprof for your LLM context. See where every token and dollar goes.

PyPI npm CI Python 3.9+ MIT Docs Live demo 100% local

Try the live dashboard in your browser - no install, a real recorded session.

llmprof: point your base_url at the proxy and see a flame graph of where every token went, plus the dollars you can reclaim

lossless animation; a GIF version is also available · open the interactive demo

You profile CPU and memory. Why fly blind on the most expensive resource in your AI app, the context window? Your billing page is a meter - it says how much you spent. llmprof is a profiler - it says where each request's tokens went (system prompt vs. tool schemas vs. RAG vs. history), prices the call, flame-graphs it, and tells you what to cut.

pipx install llmprof && llmprof up      # or, no Python:  npx llmprof up

Point your client's base URL at http://localhost:4000/v1 (your API key passes straight through) and open http://localhost:4000.

Private by design. llmprof is fully self-hosted: it runs on your machine (or your own server), and your prompts, completions, and API keys are only ever sent to the upstream provider you already use. Nothing is sent to llmprof, a third party, or any cloud. The trace database is a local file you own. Safe to run against production traffic and client data with no new data-sharing concerns.

What you see

A flame graph of one request's tokens, with the optimization findings and the dollars you can reclaim on the call:

Context flame graph with per-tool drill-down, optimization findings, and a reclaimable-cost strip

The headline number across all your calls, projected to a month, plus day-over-day trends and a most-expensive-prompts leaderboard:

Trends view with a reclaimable-per-month banner, today vs yesterday cards, a cost-per-day chart, and a by-model breakdown

Context creep across an agent's turns - history balloons while the system prompt and tools stay flat:

Context timeline showing prompt tokens per turn growing across a run

Quickstart

from openai import OpenAI
client = OpenAI(base_url="http://localhost:4000/v1")  # the only change
client.chat.completions.create(model="gpt-4o", messages=[...], tools=[...])

One proxy profiles both providers (and Codex + Claude Code) at once - for Anthropic just set the base URL (no /v1):

from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:4000")

Then open the dashboard, or llmprof traces for a terminal summary. Full docs: https://luthrag.github.io/llmprof.

Features

  • Context flame graph - per-request token breakdown with per-tool drill-down.
  • Waste detector - duplicated content, unused tool schemas, and uncached prefixes, rolled into a "$X/mo reclaimable" headline.
  • Context timeline - how context grows turn over turn across an agent run.
  • Cost leaderboard - which prompt template (system prompt + tools) drives the bill, not just which model.
  • Cost for 1000+ models from a bundled LiteLLM snapshot (offline, no fetch), with curated rates for the newest flagships and LLMPROF_PRICING overrides.
  • Runs local, single SQLite file, with a pluggable backend for a shared database.

Works with

  • Any OpenAI-compatible API via /v1/chat/completions and /v1/responses. Defaults to OpenAI; to use another (Azure, Groq, Together, OpenRouter, DeepSeek, Fireworks, Gemini's OpenAI endpoint, local Ollama / vLLM) set --upstream.
  • Anthropic via /v1/messages (auto-routed, no flag needed).
  • Claude Code and the Codex CLI - set their base URL to the proxy.
  • Any language - the proxy is a local HTTP service; only the base URL changes.

SDKs

When the proxy's heuristics are not enough, label components yourself for precise attribution:

# Python
import llmprof
with llmprof.profile(model="gpt-4o") as p:
    p.add("system prompt", system_text)
    p.add("rag_chunk", doc, name="kb#42")
    p.add("tool", search_schema, name="search", called=True)
    p.usage(resp.usage)
// JavaScript / TypeScript  (npm i @llmprof/sdk)
import { profile } from "@llmprof/sdk";
await profile({ model: "gpt-4o" }, async (p) => {
  p.add("system prompt", systemText);
  p.add("rag_chunk", doc, { name: "kb#42" });
  p.add("tool", searchSchema, { name: "search", called: true });
  p.usage(resp.usage);
});

How it works

The proxy forwards your request unchanged and streams the response straight back; the analysis (tokenizing, attribution, pricing, waste detection) happens off the hot path, so it adds essentially no latency. See the architecture docs for the full picture.

Configuration

What Flag Env var Default
Bind host --host LLMPROF_HOST 127.0.0.1
Bind port --port LLMPROF_PORT 4000
Upstream API --upstream LLMPROF_UPSTREAM OpenAI
Price overrides LLMPROF_PRICING built-in table
Data dir LLMPROF_HOME ~/.llmprof
Storage backend LLMPROF_DB_URL SQLite (local file)

What llmprof is not

Not a full observability platform (no eval suite, prompt management, or hosted cloud, that is Langfuse / Phoenix). llmprof is the focused profiler: where your tokens go, and what to cut.

Develop

python -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"
ruff check . && pytest

The dashboard is dependency-light vanilla JS/SVG; docs live in docs/ (Astro Starlight). See Contributing and the changelog. Runnable examples cover both providers and both SDKs.

License

MIT (c) Gaurav Luthra

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmprof-0.1.3.tar.gz (94.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmprof-0.1.3-py3-none-any.whl (70.6 kB view details)

Uploaded Python 3

File details

Details for the file llmprof-0.1.3.tar.gz.

File metadata

  • Download URL: llmprof-0.1.3.tar.gz
  • Upload date:
  • Size: 94.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for llmprof-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5b12f8f6dd509020b7056d6c55e9749dcde65b68966c2f173e46b96e69165aba
MD5 ffdb532e12d2a076c016f8627dd8ed14
BLAKE2b-256 a38e9f38f7cb103b3d49c236c002a1af801f62a831e26ba7e313d13c7629f86f

See more details on using hashes here.

File details

Details for the file llmprof-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: llmprof-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 70.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for llmprof-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 89ff4d1fb10d7026e85f05e4f012aeb78fccb8e2a53f669004e6c32f3c7bdde7
MD5 15196c80c9179c0237a55b7c9bdc6f48
BLAKE2b-256 916402dc135d3b0fd92d18c728238011578f6b9c46b9382af912ddb9ac1925ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page