Skip to main content

OpenAI-compatible LLM proxy with SQLite request capture, observability, and an admin UI.

Project description

LLM Observe Proxy

llm-observe-proxy is an OpenAI-compatible, record-only-by-default proxy for inspecting LLM traffic. It forwards requests to an upstream /v1 API, stores requests and responses in SQLite, and provides a polished local admin UI for browsing, pretty-printing, trimming, grouping task runs, and changing runtime settings.

It is useful when you want LiteLLM-style observability without introducing a full gateway or external database.

Project repository: https://github.com/shamitv/llm-observe-proxy

Current release: 0.3.0, with editable model pricing, cost snapshots, run what-if comparisons, and a refined run detail summary layout.

Features

  • OpenAI-compatible passthrough route: ANY /v1/{path:path}.
  • SQLite capture for request/response headers, bodies, status, timing, model, endpoint, streaming state, tool-call signals, image assets, provider token usage, cost snapshots, and errors.
  • Admin UI for searching and browsing captured traffic, including per-request output TPS and estimated cost.
  • Runs for grouping all requests made during a task, benchmark, or repro workflow.
  • Run detail pages with request counts, LLM wall time, token totals, cost totals, tokens/sec, model and endpoint breakdowns, and signal/error counts.
  • Run what-if pricing for comparing captured usage against other configured model prices.
  • Detail pages with response render modes for JSON, plain text, Markdown, tool calls, and raw SSE streams.
  • Request image gallery for data URL and remote image references.
  • Settings UI for upstream URL, model upstream routes, model provider/pricing config, response compatibility fixes, incoming host/port preferences, all-IPs exposure, and retention trimming.
  • Config-driven model routes for sending selected proxy-facing model names to different upstream /v1 endpoints with optional upstream model rewrites, provider selection, and API key injection.
  • Opt-in response compatibility fixes for known upstream quirks, with raw upstream response audit storage when a rewrite or warning occurs.
  • No authentication by default, intended for local or trusted development networks.

Install

From PyPI with pip:

python -m pip install llm-observe-proxy
llm-observe-proxy

From PyPI with uv:

uv tool install llm-observe-proxy
llm-observe-proxy

Run it once without installing:

uvx llm-observe-proxy

By default, the proxy listens on:

http://localhost:8080

and forwards requests to:

http://localhost:8000/v1

Open the admin UI:

http://localhost:8080/admin

Usage

Point an OpenAI-compatible client at the proxy:

from openai import OpenAI

client = OpenAI(
    api_key="local-dev-key",
    base_url="http://localhost:8080/v1",
)

response = client.chat.completions.create(
    model="gpt-demo",
    messages=[{"role": "user", "content": "Hello through the proxy"}],
)
print(response.choices[0].message.content)

Run on a different port:

llm-observe-proxy --port 8090

Expose on all interfaces:

llm-observe-proxy --expose-all-ips

Set the upstream from the CLI:

llm-observe-proxy --upstream-url http://localhost:8000/v1

Load model-specific upstream routes from a JSON file:

llm-observe-proxy --models-file .\models.json

You can also change the upstream URL, model upstream routes, response compatibility fixes, model provider pricing, and next-start incoming host/port settings from /admin/settings.

Model Routes

Model routes let one proxy endpoint send different client-facing models to different OpenAI-compatible upstreams. Routes match the request payload's top-level model exactly. Unknown models, requests without a JSON model, and generic calls such as GET /v1/models use the global upstream fallback.

Example route file:

[
  {
    "model": "local-qwen",
    "upstream_url": "http://localhost:8000/v1",
    "upstream_model": "qwen3-coder-30b"
  },
  {
    "model": "openai-mini",
    "upstream_url": "https://api.openai.com/v1",
    "upstream_model": "gpt-4.1-mini",
    "provider_slug": "openai",
    "api_key_env": "OPENAI_API_KEY"
  }
]

The same file can use an object form when you want default-upstream fixes as well as route-specific fixes:

{
  "default_fixes": [],
  "model_routes": [
    {
      "model": "local-qwen",
      "upstream_url": "http://localhost:8000/v1",
      "upstream_model": "qwen3-coder-30b",
      "fixes": ["qwen-tagged-tool-call-rewrite"]
    }
  ]
}

Run with the file:

$env:OPENAI_API_KEY = "sk-..."
llm-observe-proxy --models-file .\models.json

You can also set LLM_OBSERVE_MODELS_JSON to the same JSON array. If both LLM_OBSERVE_MODELS_FILE and LLM_OBSERVE_MODELS_JSON are set, the file wins.

You can add, update, and delete UI-managed model routes from /admin/settings. UI-managed routes are stored in SQLite and take effect immediately. Routes loaded from --models-file, LLM_OBSERVE_MODELS_FILE, or LLM_OBSERVE_MODELS_JSON remain read-only in the UI, and duplicate model names are rejected.

When a route has an API key, the proxy injects Authorization: Bearer <key> for the upstream request. Captured request headers remain the original client headers; injected keys are not stored or shown in the admin UI. UI-managed routes store only api_key_env; prefer api_key_env for shared configs.

Response Compatibility Fixes

Compatibility fixes are ordered, opt-in response transformations for known model/provider quirks. The first built-in fix is qwen-tagged-tool-call-rewrite, which can promote a complete Qwen-style <tool_call> block from Chat Completions reasoning_content or reasoning into structured OpenAI-compatible tool_calls.

The Qwen fix runs only on /v1/chat/completions when the request declares tools. It does not execute tools. Malformed or ambiguous blocks pass through unchanged and are recorded as warnings. When a fix rewrites or warns, the request detail page stores and shows both the client-visible response and the raw upstream response.

Configure fixes from /admin/settings, per model route, or with environment variables:

$env:LLM_OBSERVE_DEFAULT_FIXES_JSON = '["qwen-tagged-tool-call-rewrite"]'

Cost Estimates

Cost estimates are snapshotted when a response is captured. The proxy stores the billing provider, billing model, token counts, input/output rate snapshot, and estimated USD cost on the request row. Historical rows are not recalculated when pricing changes.

Token counts are extracted from OpenAI-compatible usage objects, including the shapes used by OpenAI, vLLM, SGLang, and LM Studio. When standard usage is absent, the proxy can also read llama.cpp timings and Ollama-style prompt_eval_count / eval_count fields if those metrics are present in captured /v1 responses or stream events.

The estimator uses separate input and output token rates per 1M tokens:

cost = (input_tokens * input_rate + output_tokens * output_rate) / 1,000,000

Billing identity is resolved from the routed upstream model when a model route rewrites the request, otherwise from the upstream response model when present, otherwise from the client request model. Provider identity comes from a route's optional provider_slug, then falls back to a provider whose configured upstream URL exactly matches the active upstream base.

SQLite is seeded with editable standard paid text rates for common OpenAI, Anthropic, and Google Gemini models. Those seed values were checked against official pricing pages on May 3, 2026. They are inserted only when missing, so UI edits are preserved. V1 cost estimates intentionally ignore cache, batch/flex/priority tiers, tool fees, image/audio pricing, regional premiums, and long-context premiums.

Run detail pages include what-if cost comparisons. By default they compare captured run usage against GPT-5.5 and GPT-5.4 Mini when those prices are active. You can select any other active model price from the run page or link directly with repeated query parameters:

/admin/runs/1?what_if=openai:gpt-5.5&what_if=openai:gpt-5.4-mini

What-if comparisons use stored request token counts and do not change captured request cost snapshots.

Runs

Use Runs when you want to measure or review LLM usage for one bounded task, such as processing a video, comparing local and cloud models, or reproducing an agent issue.

  1. Open /admin/runs or use the run control on /admin.
  2. Enter a required run name and choose Start run.
  3. Run your application or benchmark through the proxy.
  4. Choose End run when the task is complete.

Starting a new run automatically ends any existing active run. Requests made while a run is active are linked to that run; requests outside a run are still captured normally.

The request browser can filter by run, and request rows link back to their run. The run detail page reports LLM wall time from the first request start to the last response completion, plus token totals, cost totals, and tokens/sec metrics. The request table's TPS column shows per-request output tokens per second when token usage and duration are available. Run-level Output tok/s uses output tokens divided by summed request duration, matching the total request duration shown on the page.

Screenshots and the full developer README are available in the project repository: https://github.com/shamitv/llm-observe-proxy

Routes

  • ANY /v1/{path:path}: OpenAI-compatible pass-through proxy.
  • GET /admin: request browser.
  • GET /admin/requests/{id}: request/response detail view.
  • GET /admin/runs: run browser and active run controls.
  • GET /admin/runs/{id}: run metrics, what-if cost comparison, and associated request list.
  • POST /admin/runs/start: start a named run, ending any active run first.
  • POST /admin/runs/end: end the active run.
  • GET /admin/settings: upstream settings and retention tools.
  • POST /admin/settings/incoming: update incoming host/port settings for next startup.
  • POST /admin/settings/upstream: update upstream URL.
  • POST /admin/settings/compat-fixes: update default-upstream compatibility fixes.
  • POST /admin/settings/model-routes: create or update a UI-managed model route.
  • POST /admin/settings/model-routes/delete: delete a UI-managed model route.
  • POST /admin/settings/providers: create or update a model provider.
  • POST /admin/settings/providers/delete: delete a model provider and its prices.
  • POST /admin/settings/model-prices: create or update model token pricing.
  • POST /admin/settings/model-prices/delete: delete model token pricing.
  • POST /admin/trim: delete records older than N days.
  • GET /healthz: health check.

Configuration

Environment variables:

Variable Default Purpose
LLM_OBSERVE_DATABASE_URL sqlite:///./llm_observe_proxy.sqlite3 SQLite SQLAlchemy URL.
LLM_OBSERVE_INCOMING_HOST localhost Bind host when not exposing all IPs.
LLM_OBSERVE_INCOMING_PORT 8080 Bind port.
LLM_OBSERVE_EXPOSE_ALL_IPS false Bind to 0.0.0.0 when true.
LLM_OBSERVE_UPSTREAM_URL http://localhost:8000/v1 Upstream OpenAI-compatible /v1 base URL.
LLM_OBSERVE_MODELS_JSON unset JSON array of model route objects, or an object with default_fixes and model_routes.
LLM_OBSERVE_MODELS_FILE unset Path to a JSON file containing model routes or model config. Wins over LLM_OBSERVE_MODELS_JSON.
LLM_OBSERVE_DEFAULT_FIXES_JSON unset JSON array of default-upstream compatibility fix IDs when no model config object supplies default_fixes.
LLM_OBSERVE_LOG_LEVEL INFO Uvicorn log level.

Incoming host/port settings saved in the UI are used on the next process startup; they do not rebind a currently running process.

Tests

.\.venv\Scripts\ruff.exe check src tests
.\.venv\Scripts\python.exe -m compileall -q src tests
.\.venv\Scripts\pytest.exe -q

The test suite starts a fake upstream on localhost:8080/v1, so stop any local process using port 8080 before running tests.

Publishing

See the repository publishing guide for name checks, build commands, and the pre-publish checklist.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_observe_proxy-0.3.0.tar.gz (600.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_observe_proxy-0.3.0-py3-none-any.whl (64.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_observe_proxy-0.3.0.tar.gz.

File metadata

  • Download URL: llm_observe_proxy-0.3.0.tar.gz
  • Upload date:
  • Size: 600.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for llm_observe_proxy-0.3.0.tar.gz
Algorithm Hash digest
SHA256 e9e2c5c6b13b34fb2b10a44f22e7ffb8578b5f7beb5758f78208e7a970ff32f5
MD5 f4c71424ce0c446a9d910704248a6a20
BLAKE2b-256 5fd6758640b2b8e2c48e3d8c1ae369926ff6955f58462d8f29ce265a36cc54d8

See more details on using hashes here.

File details

Details for the file llm_observe_proxy-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_observe_proxy-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a72a40bd40953f7695f633ccf376ba91e87421532ee857e648ddff98826c9569
MD5 a9d7259d447e3aaeb583e540c46bb5b5
BLAKE2b-256 289bec2e5605f5cffae3f44e1958b5ee8e51110287f930e21b2199bb97909ca1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page