Skip to main content

Cost-vs-accuracy CI for LLM ops. Pick the cheapest API tier, compare self-hosted vLLM vs cloud APIs on one Pareto, and grade open-ended outputs with an LLM-as-judge scorer — all on your own data with Wilson 95% CIs.

Project description

depth-lens

Pick the cheapest LLM config that meets your accuracy bar — for your data, not somebody else's benchmark.

Sweep every (model, knob) on your prompts in one CLI call. Wilson 95% CIs, per-call cost, latency p50, cross-vendor. Roughly $1 and ten minutes per audit.

日本語版

tests License: MIT Python 3.11+ Status: v2.1

Switching from Opus 4.7 to Haiku 4.5 saves $123k/year on a 10k-calls/day workload — same accuracy

The plot above is a real depth-lens recommend output. Four Anthropic configurations all score 1.00 accuracy on a K-hop tier-4 prompt set — and span ~35× in cost. That's the gap the "use the latest / biggest" instinct burns through silently. depth-lens finds the cheapest passing tier on your prompts in under 10 minutes. The 30-second install is right below.

30 seconds: install, run, decide

pip install depth-lens[openai]              # add ,anthropic / ,gemini as needed
export OPENAI_API_KEY=...

# Use the bundled example bench (5 modular-arithmetic prompts), or write your own:
python -c "from depth_lens.data import copy_example; copy_example('modular_arithmetic.jsonl')"

depth-lens recommend \
    --models openai:gpt-5-mini,openai:o4-mini \
    --task custom:modular_arithmetic.jsonl:first_int \
    --target-accuracy 0.95 \
    --max-latency 3.0 \
    --n-samples 16 \
    --daily-calls 10000

To bring your own prompts, swap modular_arithmetic.jsonl for any JSONL of {"prompt": ..., "target": ..., "depth": ...} rows.

============================================================================================
Target accuracy ≥ 0.95  ·  Max latency ≤ 3.00s/pred
Probed 6 configurations, 6 passing.
============================================================================================

✅ Passing (cheapest first):
  openai:gpt-5-mini     d=1  effort=low      acc=1.00   $0.354/k-pred   0.45s/pred  ← cheapest
  openai:gpt-5-mini     d=1  effort=medium   acc=1.00   $0.485/k-pred   0.59s/pred
  openai:o4-mini        d=1  effort=low      acc=1.00   $0.736/k-pred   0.29s/pred  ← fastest
  openai:gpt-5-mini     d=1  effort=high     acc=1.00   $0.886/k-pred   0.69s/pred
  openai:o4-mini        d=1  effort=medium   acc=1.00   $1.061/k-pred   0.37s/pred
  openai:o4-mini        d=1  effort=high     acc=1.00   $1.365/k-pred   0.40s/pred

⚡ Cost-vs-speed tradeoff among passing configs:
  Cheapest is 1.5× slower than fastest; fastest costs 2.1× more per call.

============================================================================================
At 10,000 calls/day with the cheapest passing config:
  openai:gpt-5-mini @ effort=low
  → $3.54/day  $1,291/year

  Switching from openai:o4-mini @ effort=high ($13.65/day)
  saves $10.11/day = $3,691/year (74% reduction)

You now have a defensible answer to "do we really need the bigger / more-thinking config?" — backed by a real sweep with Wilson 95% CIs on your prompts. Swap --models for Anthropic / Gemini / vLLM (self-hosted) and re-run; the workflow is identical.

Evidence: three real business tasks, three measurements

We ran depth-lens end-to-end on three production-style chatbot tasks that map to three different scoring needs. Same recommend workflow, three different scorers.

Case 1 — Tenant-inquiry urgency classifier (real-estate management)

Classify tenant messages into 緊急 / 通常 / 翌営業日 (urgent / business hours / next business day). 20 realistic prompts: water leaks, gas leaks, lockouts, contract questions, noise complaints.

Config Accuracy Latency p50
openai:o4-mini @ effort=medium ← chosen 100% 0.52 s
openai:gpt-5-mini @ effort=low 95% 0.67 s
openai:o4-mini @ effort=high (default "safe") 100% 0.74 s

Cost reduction: ~88% vs defaulting to o4-mini @ high or gpt-5. Domain insight depth-lens surfaced: the 95% config's single miss was 通常 → 翌営業日 (safe direction). No 緊急 → 通常 errors — accuracy alone undersells the cheaper config's safety profile.

Case 2 — System-monitoring quote estimator (MSP / IT ops)

Compute monthly quote estimates from free-form Japanese inquiries (plan tier × server count × options × volume discount). 53 prompts across 5 difficulty tiers including typos, formal/casual mixed, implicit tier hints like "ミッションクリティカル" → premium.

Config All 5 tiers acc Latency p50
openai:gpt-5-mini @ effort=low ← chosen 100% (53/53) 0.41 – 0.50 s
openai:o4-mini @ effort=medium 100% 0.65 s
openai:o4-mini @ effort=high (default "safe") 100% 0.70 s

Cost reduction: ~88% vs the "complex calculation needs a more capable model" intuition. Counter-intuitive finding: multi-step pricing math + production-realistic messy input both solved by the cheapest config. gpt-5-mini @ low handles compound discount logic, mixed plans, AND colloquial Japanese ("がっつり監視で") at 100%.

Case 3 — Tenant-reply quality, judged by LLM (v2.1)

Same property-management chatbot, but generating free-form replies to tenant inquiries. Quality judged by a separate LLM against a 3-criterion rubric (polite using 敬語, addresses the specific issue, proposes a concrete next step). 12 prompts spanning urgent / procedural / rules / complaints / repairs.

Config All 3 criteria met Per-reply latency
openai:gpt-5-mini @ effort=low ← chosen 100% (12/12) 1.4 s
openai:o4-mini @ effort=high 100% (12/12) 1.8 s
openai:gpt-5-mini @ effort=high (default "safe") 75% (9/12) 15.6 s ← unusable

Counter-intuitive: gpt-5-mini accuracy decreases with higher effort (low 100% → high 75%) — over-elaboration breaks the rubric's "addresses the specific issue / concrete next step" criteria. o4-mini shows the opposite curve. Optimal effort is per-(model, task), not universal. Why this case matters: until v2.1, depth-lens couldn't measure free-form replies — only structured tasks like Cases 1 and 2. The new llm: scorer made this measurable.

Full case study →

Five patterns these three cases collectively show

  1. "Use the bigger model to be safe" is a strict loss when measured — same accuracy, more cost, no latency budget gained. Case 3 sharpens this: higher effort can decrease accuracy on free-form tasks where over-elaboration hurts the rubric.
  2. Stratified bench (simple → production-messy) reveals where each tier breaks — or, as in Case 2, that none of the candidates do.
  3. ~80-90% cost reduction is typical when teams stop pre-judging model selection and run a quick depth-lens sweep instead.
  4. Production-realistic input must be in the bench from day 1. Synthetic tier-1 prompts alone systematically over-recommend expensive models — Case 2's 30 messy "real-log-style" prompts were what produced the conclusion's confidence interval.
  5. The right scorer matters more than the model choice. The three cases cover the three scorer families depth-lens ships (structured / regex / LLM-as-judge). New production tasks land in one of these buckets.

Can I measure my task? Three scorer families

Family Spec form When to use
Structured exact, first_int, last_int, yes_no, contains Classification, numeric answers, yes/no decisions. Cases 1 and 2 above.
Regex regex:<pattern> Format-checking, "answer must match this shape".
LLM-as-judge llm:<judge-model>:<criterion> or llm:<judge-model>:rubric:<text> Open-ended outputs: summaries, free-form Q&A, multi-criterion checks. Case 3 above.

Built-in criteria for llm:: correct / faithful / helpful / concise / format / polite. Free-form rubrics are arbitrary text after :rubric:.

# LLM-judge example: grade summary faithfulness with gpt-5-mini
depth-lens recommend \
    --models openai:gpt-5-mini,openai:o4-mini \
    --task "custom:./summaries.jsonl:llm:openai:gpt-5-mini:faithful" \
    --target-accuracy 0.85 --n-samples 32

Pick a different (and ideally cheaper) judge than the model under test to avoid self-judging bias. As of 2026, gemini-3.1-flash-lite is the cheapest competent judge.

Between these three families, almost every production AI task is measurable — classification, structured extraction, RAG-faithfulness, customer-support reply quality, code review, tone checks. If your task doesn't fit, file an issue.

What's in the box

6 adapter families

Spec Compute knob Cost basis
anthropic:<model> thinking_budget_tokens API ($/M-token)
openai:<model> reasoning_effort API ($/M-token)
gemini:<model> thinking_budget_tokens (2.5) / auto-mapped to thinking_level (3.x) API ($/M-token)
vllm:<model> reasoning_effort (thinking models) or max_tokens (instruct-only, OpenAI-compatible local server) self-hosted ($/GPU-hour)
hf:<hf-model-id> max_thinking_tokens (CoT length) local GPU ($/GPU-hour)
openmythos n_loops (Recurrent-Depth Transformer) local GPU ($/GPU-hour)

API adapters fan requests through a thread pool (max_concurrent); a 1000-prompt probe finishes in minutes, not hours.

5 built-in probe tasks + custom

Task Depth axis Reasoning shape
k-hop K (operators) Forward composition (mod-arithmetic)
parity n (bits) Aggregation (XOR reduction)
graph-reach path length Single BFS pass
state-tracking K (instructions) 2-counter register machine
mini-csp n (variables) Search / constraint propagation (2-SAT)
dict-lookup n (pairs) Field extraction from structured input (v2.0)
custom:<jsonl>:<scorer> optional depth field Bring your own data

Diagnostics every ProbeResult exposes

  • .accuracy[depth][compute] grid in [0, 1]
  • .ci() — Wilson 95% intervals on every cell
  • .effective_depth(threshold=0.5) — biggest depth where some compute level clears the bar
  • .overthinking(depth, tolerance=0.02) — peak compute is not max compute, by how much
  • .cost_per_cell(pricing) — $/prediction. Token-based ({input, output} USD-per-1M) or GPU-hour ({gpu_hourly, gpus}) — pick whichever fits the adapter

CLI

depth-lens recommend ... # find cheapest model meeting your accuracy bar (production workflow)
depth-lens probe ...     # detailed sweep of one model
depth-lens compare ...   # overlay several models on the same task
depth-lens dashboard     # Streamlit UI over your cached probes

Each subcommand has full --help. See docs/playbook/ for end-to-end production scenarios: model-downgrade · cost-audit · regression-detection · self-hosting-with-vllm.

Python API

from depth_lens import probe
from depth_lens.tasks import get_task
from depth_lens.adapters.anthropic_adapter import AnthropicAdapter

task = get_task("mini-csp")
adapter = AnthropicAdapter(model="claude-haiku-4-5", task_name="mini-csp")
result = probe(adapter, task, depths=[3, 5, 7, 9], n_samples=16)

print(f"effective depth: {result.effective_depth(0.5)}")
print(f"overthinking @ d=9: {result.overthinking(9)}")
print(f"$/pred @ d=9 mid budget: {result.cost_per_cell({'input': 1.0, 'output': 5.0})[3, 1]}")

What depth-lens is NOT

  • We don't run MMLU, GSM8K, or similar leaderboards. Those crown frontier models on canonical benchmarks; production teams already picked a model family and need to tune within it.
  • We don't test "is the model smart." We test "which configuration of this family meets your accuracy bar at the lowest cost / latency / GPU-time."
  • We don't ship a managed dashboard. The OSS produces JSONs and plots locally; building hosted dashboards on top of those is outside scope.
Capability LLMThinkBench usail-hkust bench o1 scaling laws depth-lens
Compute-axis curves (not single point) partial ✅ (o1 only)
Cross-vendor (Claude / o-series / Gemini / OSS) ❌ HF only partial ❌ o1 only
Self-hosted vLLM on same axis as APIs
Looped transformer (OpenMythos)
Bring-your-own JSONL
LLM-as-judge scorer for open-ended tasks
Cost per prediction with sweep

Closest active competitor is LLMThinkBench, which targets math-task overthinking on HuggingFace models at a fixed operating point — orthogonal to depth-lens's compute-axis sweep across vendor APIs.

Use cases depth-lens is built for

You are asking… What depth-lens recommend outputs Headline evidence
1. Which API tier / thinking budget should I be paying for? Cheapest passing (model, knob) across your prompts Opus 4.7 → Haiku 4.5 saves ~$123k/year at 10k call/day, same accuracy (finding)
2. Should I self-host an open model instead of paying the API? API and vLLM points on one Pareto ($/M-token vs $/GPU-hour, same axis) gemini-3.1-flash-lite beats every 4080 SUPER self-hosted candidate at K-hop tier 4; Llama-3-8B AWQ is cheapest at tier 1 ($0.028/1k calls) (finding)
3. Can I measure free-form output quality? LLM-judge scores with the same Wilson CIs as structured scorers Case 3 above — gpt-5-mini @ low wins on a 3-criterion rubric; higher effort decreases quality (finding)

For research-oriented use (paradigm scaling, inference-time-compute measurement infrastructure), see the v2.0 cross-paradigm measurement plot — a tool for putting Token-CoT API · Self-hosted vLLM · Looped transformers on a single FLOPs axis. We emphasize this is the measurement tool contribution; the underlying observation (specialized model beats generalist on the specific task it was trained for) is deep-learning textbook material.

All findings the tool has produced

We ran depth-lens on every vendor we could get an API key for, on all bundled tasks — current generation and one generation back to keep the cross-vendor comparison fair. Total spend: ~$14 API + ~30 min local GPU + ~$1 LLM-judge (case study 3).

Use case Finding Why it matters
API ops Opus 4.7 → Haiku 4.5 saves ~$123k/year on a 10k-call/day task 4 concrete tier-downgrade savings switches in $
API ops gpt-5-mini cheaper-per-token but 3× slower than o4-mini $/token alone burns UI latency; Pareto frontier on K-hop tier 4 has 2 points
API ops Haiku 4.5 collapses on hard 2-SAT at default budget Constraint-style problems need budget ≥ 4096 or 2× error rate
API ops Gemini 2.5 Flash uniquely weak vs same-era Anthropic / OpenAI cheap reasoning 3.1 Flash-Lite closes the gap
API ops Claude Opus 4.7 cost varies 10× across (depth × budget) at fixed accuracy Maxing the budget is a strict cost loss
API ops Per-vendor cost-vs-latency plots One scatter per vendor — Pareto frontier vs budget knobs
Build vs buy Self-hosted vLLM vs hosted APIs on one Pareto Llama-3-8B AWQ is cheapest at tier 1; 0% acc at tier 4. DeepSeek-R1-Distill-1.5B hits 0.75 at tier 4. Build-vs-buy as a chart, not a guess
Open-ended Customer-reply quality via LLM-as-judge (Case 3) gpt-5-mini accuracy decreases with effort on free-form tasks; optimal effort is per-(model, task)
Research / tool v2.0 — 3 inference-time-compute paradigms on one FLOPs axis Infrastructure to compare Token-CoT API · Self-hosted vLLM · Looped (OpenMythos 1M/10M/100M) on the same axis. The headline 24,000-410,000× FLOPs ratio is a deep-learning-textbook result; the tool is the contribution
Research OpenMythos vs Claude head-to-head Within training distribution, 925K-param looped is ~10,000× faster than Claude at same accuracy. Outside it, API dominates
Research OpenMythos loops-vs-accuracy saturation "More loops = deeper reasoning" saturates at training_max_loop_iters
Research OpenMythos extrapolates 1-2 hops past training depth on K-hop Seed experiment that motivated the project

→ Full v1.0 cross-vendor summary

Status

  • v0.1 MVP — first end-to-end probe (May 2026)
  • v0.5 — 4 tasks, 5 adapters, Wilson CIs, cache, Streamlit dashboard
  • v1.0 — 6 adapter families, 5 tasks, full cross-vendor benchmark (Anthropic/OpenAI/Gemini, current + 2025 prior gen), multi-stage Docker, GitHub Actions CI
  • v1.1 — OpenMythos head-to-head; cross-paradigm Pareto
  • v1.2 — self-hosted vLLM with GPU-hour pricing on the same Pareto
  • v2.0 — 3-paradigm FLOPs measurement tool, dict-lookup task, depth_lens.flops module
  • v2.1 — LLM-as-judge scorer for open-ended tasks (llm:<judge>:<criterion>), tenant-reply case study
  • v2.2 — PyPI publish, judge cost folded into recommend $/k-pred, --free-form CLI flag, code-generation task

128 unit tests passing. See ROADMAP.md for what's next.

Install variants

# API-only (no GPU needed) — Anthropic, OpenAI, Gemini, dashboard
pip install -e .[anthropic,openai,gemini,dashboard]

# +looped transformer + HuggingFace local probes
pip install -e .[openmythos,huggingface,anthropic,openai,gemini,dashboard]

# +self-hosted vLLM (vLLM runs separately via docker compose)
pip install -e .[anthropic,openai,gemini,dashboard]   # OpenAI SDK is all that's needed client-side

# Just the framework (BYO adapters)
pip install -e .

Python 3.11+. The bundled OpenMythos training helper assumes CUDA; everything else is happy on CPU or against remote APIs.

Contributing

See CONTRIBUTING.md for how to add a Task or an Adapter (both are ~50 lines + a test) and the conventions used in the bundled implementations.

Citation

@software{depth_lens_2026,
  title  = {depth-lens: Measuring Inference-Time Compute for LLM Production Decisions},
  author = {yutoTachibana},
  year   = {2026},
  url    = {https://github.com/yutoTachibana/depth-lens}
}

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

depth_lens-2.2.0.tar.gz (97.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

depth_lens-2.2.0-py3-none-any.whl (80.6 kB view details)

Uploaded Python 3

File details

Details for the file depth_lens-2.2.0.tar.gz.

File metadata

  • Download URL: depth_lens-2.2.0.tar.gz
  • Upload date:
  • Size: 97.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for depth_lens-2.2.0.tar.gz
Algorithm Hash digest
SHA256 cc6816a1b663e3149c8a83bbbc8f77b3ac8dcc3525c28dc6043c636f791c701a
MD5 97e32c609bffa1be81d6e64a135c442a
BLAKE2b-256 84b4da7ec7d618ff7c1d8af05cce1f7e6871d43b5c7205b8891aae0e8abb5819

See more details on using hashes here.

File details

Details for the file depth_lens-2.2.0-py3-none-any.whl.

File metadata

  • Download URL: depth_lens-2.2.0-py3-none-any.whl
  • Upload date:
  • Size: 80.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for depth_lens-2.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e4324e24848132c07d24664f7dfd725a8d6835afc5d7fa9661791f316e2ab01
MD5 b5295375aac27268097b3355bb5f294e
BLAKE2b-256 5407c9c50fda72f44fbacbaa9edebcea26618a8180e29e358b582733e9104565

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page