Cost-vs-accuracy CI for LLM ops. Pick the cheapest API tier, compare self-hosted vLLM vs cloud APIs on one Pareto, and grade open-ended outputs with an LLM-as-judge scorer — all on your own data with Wilson 95% CIs.

These details have not been verified by PyPI

Project links

Project description

depth-lens

Pick the cheapest LLM config that meets your accuracy bar — for your data, not somebody else's benchmark.

Sweep every (model, knob) on your prompts in one CLI call. Wilson 95% CIs, per-call cost, latency p50, cross-vendor. Roughly $1 and ten minutes per audit.

日本語版

Switching from Opus 4.7 to Haiku 4.5 saves $123k/year on a 10k-calls/day workload — same accuracy

The plot above is a real depth-lens recommend output. Four Anthropic configurations all score 1.00 accuracy on a K-hop tier-4 prompt set — and span ~35× in cost. That's the gap the "use the latest / biggest" instinct burns through silently. depth-lens finds the cheapest passing tier on your prompts in under 10 minutes. The 30-second install is right below.

30 seconds: install, run, decide

pip install depth-lens[openai]              # add ,anthropic / ,gemini as needed
export OPENAI_API_KEY=...

# Use the bundled example bench (5 modular-arithmetic prompts), or write your own:
python -c "from depth_lens.data import copy_example; copy_example('modular_arithmetic.jsonl')"

depth-lens recommend \
    --models openai:gpt-5-mini,openai:o4-mini \
    --task custom:modular_arithmetic.jsonl:first_int \
    --target-accuracy 0.95 \
    --max-latency 3.0 \
    --n-samples 16 \
    --daily-calls 10000

To bring your own prompts, swap modular_arithmetic.jsonl for any JSONL of {"prompt": ..., "target": ..., "depth": ...} rows.

============================================================================================
Target accuracy ≥ 0.95  ·  Max latency ≤ 3.00s/pred
Probed 6 configurations, 6 passing.
============================================================================================

✅ Passing (cheapest first):
  openai:gpt-5-mini     d=1  effort=low      acc=1.00   $0.354/k-pred   0.45s/pred  ← cheapest
  openai:gpt-5-mini     d=1  effort=medium   acc=1.00   $0.485/k-pred   0.59s/pred
  openai:o4-mini        d=1  effort=low      acc=1.00   $0.736/k-pred   0.29s/pred  ← fastest
  openai:gpt-5-mini     d=1  effort=high     acc=1.00   $0.886/k-pred   0.69s/pred
  openai:o4-mini        d=1  effort=medium   acc=1.00   $1.061/k-pred   0.37s/pred
  openai:o4-mini        d=1  effort=high     acc=1.00   $1.365/k-pred   0.40s/pred

⚡ Cost-vs-speed tradeoff among passing configs:
  Cheapest is 1.5× slower than fastest; fastest costs 2.1× more per call.

============================================================================================
At 10,000 calls/day with the cheapest passing config:
  openai:gpt-5-mini @ effort=low
  → $3.54/day  $1,291/year

  Switching from openai:o4-mini @ effort=high ($13.65/day)
  saves $10.11/day = $3,691/year (74% reduction)

You now have a defensible answer to "do we really need the bigger / more-thinking config?" — backed by a real sweep with Wilson 95% CIs on your prompts. Swap --models for Anthropic / Gemini / vLLM (self-hosted) and re-run; the workflow is identical.

Evidence: three real business tasks, three measurements

We ran depth-lens end-to-end on three production-style chatbot tasks that map to three different scoring needs. Same recommend workflow, three different scorers.

Case 1 — Tenant-inquiry urgency classifier (real-estate management)

Classify tenant messages into 緊急 / 通常 / 翌営業日 (urgent / business hours / next business day). 20 realistic prompts: water leaks, gas leaks, lockouts, contract questions, noise complaints.

Config	Accuracy	Latency p50
`openai:o4-mini @ effort=medium` ← chosen	100%	0.52 s
`openai:gpt-5-mini @ effort=low`	95%	0.67 s
`openai:o4-mini @ effort=high` (default "safe")	100%	0.74 s

Cost reduction: ~88% vs defaulting to o4-mini @ high or gpt-5. Domain insight depth-lens surfaced: the 95% config's single miss was 通常 → 翌営業日 (safe direction). No 緊急 → 通常 errors — accuracy alone undersells the cheaper config's safety profile.

Case 2 — System-monitoring quote estimator (MSP / IT ops)

Compute monthly quote estimates from free-form Japanese inquiries (plan tier × server count × options × volume discount). 53 prompts across 5 difficulty tiers including typos, formal/casual mixed, implicit tier hints like "ミッションクリティカル" → premium.

Config	All 5 tiers acc	Latency p50
`openai:gpt-5-mini @ effort=low` ← chosen	100% (53/53)	0.41 – 0.50 s
`openai:o4-mini @ effort=medium`	100%	0.65 s
`openai:o4-mini @ effort=high` (default "safe")	100%	0.70 s

Cost reduction: ~88% vs the "complex calculation needs a more capable model" intuition. Counter-intuitive finding: multi-step pricing math + production-realistic messy input both solved by the cheapest config. gpt-5-mini @ low handles compound discount logic, mixed plans, AND colloquial Japanese ("がっつり監視で") at 100%.

Case 3 — Tenant-reply quality, judged by LLM (v2.1)

Same property-management chatbot, but generating free-form replies to tenant inquiries. Quality judged by a separate LLM against a 3-criterion rubric (polite using 敬語, addresses the specific issue, proposes a concrete next step). 12 prompts spanning urgent / procedural / rules / complaints / repairs.

Config	All 3 criteria met	Per-reply latency
`openai:gpt-5-mini @ effort=low` ← chosen	100% (12/12)	1.4 s
`openai:o4-mini @ effort=high`	100% (12/12)	1.8 s
`openai:gpt-5-mini @ effort=high` (default "safe")	75% (9/12)	15.6 s ← unusable

Counter-intuitive: gpt-5-mini accuracy decreases with higher effort (low 100% → high 75%) — over-elaboration breaks the rubric's "addresses the specific issue / concrete next step" criteria. o4-mini shows the opposite curve. Optimal effort is per-(model, task), not universal. Why this case matters: until v2.1, depth-lens couldn't measure free-form replies — only structured tasks like Cases 1 and 2. The new llm: scorer made this measurable.

Full case study →

Five patterns these three cases collectively show

"Use the bigger model to be safe" is a strict loss when measured — same accuracy, more cost, no latency budget gained. Case 3 sharpens this: higher effort can decrease accuracy on free-form tasks where over-elaboration hurts the rubric.
Stratified bench (simple → production-messy) reveals where each tier breaks — or, as in Case 2, that none of the candidates do.
~80-90% cost reduction is typical when teams stop pre-judging model selection and run a quick depth-lens sweep instead.
Production-realistic input must be in the bench from day 1. Synthetic tier-1 prompts alone systematically over-recommend expensive models — Case 2's 30 messy "real-log-style" prompts were what produced the conclusion's confidence interval.
The right scorer matters more than the model choice. The three cases cover the three scorer families depth-lens ships (structured / regex / LLM-as-judge). New production tasks land in one of these buckets.

Can I measure my task? Three scorer families

Family	Spec form	When to use
Structured	`exact`, `first_int`, `last_int`, `yes_no`, `contains`	Classification, numeric answers, yes/no decisions. Cases 1 and 2 above.
Regex	`regex:<pattern>`	Format-checking, "answer must match this shape".
LLM-as-judge	`llm:<judge-model>:<criterion>` or `llm:<judge-model>:rubric:<text>`	Open-ended outputs: summaries, free-form Q&A, multi-criterion checks. Case 3 above.

Built-in criteria for llm:: correct / faithful / helpful / concise / format / polite. Free-form rubrics are arbitrary text after :rubric:.

# LLM-judge example: grade summary faithfulness with gpt-5-mini
depth-lens recommend \
    --models openai:gpt-5-mini,openai:o4-mini \
    --task "custom:./summaries.jsonl:llm:openai:gpt-5-mini:faithful" \
    --target-accuracy 0.85 --n-samples 32

Pick a different (and ideally cheaper) judge than the model under test to avoid self-judging bias. As of 2026, gemini-3.1-flash-lite is the cheapest competent judge.

Between these three families, almost every production AI task is measurable — classification, structured extraction, RAG-faithfulness, customer-support reply quality, code review, tone checks. If your task doesn't fit, file an issue.

What's in the box

6 adapter families

Spec	Compute knob	Cost basis
`anthropic:<model>`	`thinking_budget_tokens`	API ($/M-token)
`openai:<model>`	`reasoning_effort`	API ($/M-token)
`gemini:<model>`	`thinking_budget_tokens` (2.5) / auto-mapped to `thinking_level` (3.x)	API ($/M-token)
`vllm:<model>`	`reasoning_effort` (thinking models) or `max_tokens` (instruct-only, OpenAI-compatible local server)	self-hosted ($/GPU-hour)
`hf:<hf-model-id>`	`max_thinking_tokens` (CoT length)	local GPU ($/GPU-hour)
`openmythos`	`n_loops` (Recurrent-Depth Transformer)	local GPU ($/GPU-hour)

API adapters fan requests through a thread pool (max_concurrent); a 1000-prompt probe finishes in minutes, not hours.

5 built-in probe tasks + custom

Task	Depth axis	Reasoning shape
`k-hop`	K (operators)	Forward composition (mod-arithmetic)
`parity`	n (bits)	Aggregation (XOR reduction)
`graph-reach`	path length	Single BFS pass
`state-tracking`	K (instructions)	2-counter register machine
`mini-csp`	n (variables)	Search / constraint propagation (2-SAT)
`dict-lookup`	n (pairs)	Field extraction from structured input (v2.0)
`custom:<jsonl>:<scorer>`	optional `depth` field	Bring your own data

Diagnostics every `ProbeResult` exposes

.accuracy — [depth][compute] grid in [0, 1]
.ci() — Wilson 95% intervals on every cell
.effective_depth(threshold=0.5) — biggest depth where some compute level clears the bar
.overthinking(depth, tolerance=0.02) — peak compute is not max compute, by how much
.cost_per_cell(pricing) — $/prediction. Token-based ({input, output} USD-per-1M) or GPU-hour ({gpu_hourly, gpus}) — pick whichever fits the adapter

CLI

depth-lens recommend ... # find cheapest model meeting your accuracy bar (production workflow)
depth-lens probe ...     # detailed sweep of one model
depth-lens compare ...   # overlay several models on the same task
depth-lens dashboard     # Streamlit UI over your cached probes

Each subcommand has full --help. See docs/playbook/ for end-to-end production scenarios: model-downgrade · cost-audit · regression-detection · self-hosting-with-vllm.

Python API

from depth_lens import probe
from depth_lens.tasks import get_task
from depth_lens.adapters.anthropic_adapter import AnthropicAdapter

task = get_task("mini-csp")
adapter = AnthropicAdapter(model="claude-haiku-4-5", task_name="mini-csp")
result = probe(adapter, task, depths=[3, 5, 7, 9], n_samples=16)

print(f"effective depth: {result.effective_depth(0.5)}")
print(f"overthinking @ d=9: {result.overthinking(9)}")
print(f"$/pred @ d=9 mid budget: {result.cost_per_cell({'input': 1.0, 'output': 5.0})[3, 1]}")

What depth-lens is NOT

We don't run MMLU, GSM8K, or similar leaderboards. Those crown frontier models on canonical benchmarks; production teams already picked a model family and need to tune within it.
We don't test "is the model smart." We test "which configuration of this family meets your accuracy bar at the lowest cost / latency / GPU-time."
We don't ship a managed dashboard. The OSS produces JSONs and plots locally; building hosted dashboards on top of those is outside scope.

Capability	LLMThinkBench	usail-hkust bench	o1 scaling laws	depth-lens
Compute-axis curves (not single point)	❌	partial	✅ (o1 only)	✅
Cross-vendor (Claude / o-series / Gemini / OSS)	❌ HF only	partial	❌ o1 only	✅
Self-hosted vLLM on same axis as APIs	❌	❌	❌	✅
Looped transformer (OpenMythos)	❌	❌	❌	✅
Bring-your-own JSONL	❌	❌	❌	✅
LLM-as-judge scorer for open-ended tasks	❌	❌	❌	✅
Cost per prediction with sweep	❌	❌	❌	✅

Closest active competitor is LLMThinkBench, which targets math-task overthinking on HuggingFace models at a fixed operating point — orthogonal to depth-lens's compute-axis sweep across vendor APIs.

Use cases depth-lens is built for

You are asking…	What `depth-lens recommend` outputs	Headline evidence
1. Which API tier / thinking budget should I be paying for?	Cheapest passing (model, knob) across your prompts	Opus 4.7 → Haiku 4.5 saves ~$123k/year at 10k call/day, same accuracy (finding)
2. Should I self-host an open model instead of paying the API?	API and vLLM points on one Pareto ($/M-token vs $/GPU-hour, same axis)	`gemini-3.1-flash-lite` beats every 4080 SUPER self-hosted candidate at K-hop tier 4; Llama-3-8B AWQ is cheapest at tier 1 ($0.028/1k calls) (finding)
3. Can I measure free-form output quality?	LLM-judge scores with the same Wilson CIs as structured scorers	Case 3 above — `gpt-5-mini @ low` wins on a 3-criterion rubric; higher effort decreases quality (finding)

For research-oriented use (paradigm scaling, inference-time-compute measurement infrastructure), see the v2.0 cross-paradigm measurement plot — a tool for putting Token-CoT API · Self-hosted vLLM · Looped transformers on a single FLOPs axis. We emphasize this is the measurement tool contribution; the underlying observation (specialized model beats generalist on the specific task it was trained for) is deep-learning textbook material.

All findings the tool has produced

We ran depth-lens on every vendor we could get an API key for, on all bundled tasks — current generation and one generation back to keep the cross-vendor comparison fair. Total spend: ~$14 API + ~30 min local GPU + ~$1 LLM-judge (case study 3).

Use case	Finding	Why it matters
API ops	Opus 4.7 → Haiku 4.5 saves ~$123k/year on a 10k-call/day task	4 concrete tier-downgrade savings switches in $
API ops	gpt-5-mini cheaper-per-token but 3× slower than o4-mini	$/token alone burns UI latency; Pareto frontier on K-hop tier 4 has 2 points
API ops	Haiku 4.5 collapses on hard 2-SAT at default budget	Constraint-style problems need `budget ≥ 4096` or 2× error rate
API ops	Gemini 2.5 Flash uniquely weak vs same-era Anthropic / OpenAI cheap reasoning	3.1 Flash-Lite closes the gap
API ops	Claude Opus 4.7 cost varies 10× across (depth × budget) at fixed accuracy	Maxing the budget is a strict cost loss
API ops	Per-vendor cost-vs-latency plots	One scatter per vendor — Pareto frontier vs budget knobs
Build vs buy	Self-hosted vLLM vs hosted APIs on one Pareto	Llama-3-8B AWQ is cheapest at tier 1; 0% acc at tier 4. DeepSeek-R1-Distill-1.5B hits 0.75 at tier 4. Build-vs-buy as a chart, not a guess
Open-ended	Customer-reply quality via LLM-as-judge (Case 3)	`gpt-5-mini` accuracy decreases with effort on free-form tasks; optimal effort is per-(model, task)
Research / tool	v2.0 — 3 inference-time-compute paradigms on one FLOPs axis	Infrastructure to compare Token-CoT API · Self-hosted vLLM · Looped (OpenMythos 1M/10M/100M) on the same axis. The headline 24,000-410,000× FLOPs ratio is a deep-learning-textbook result; the tool is the contribution
Research	OpenMythos vs Claude head-to-head	Within training distribution, 925K-param looped is ~10,000× faster than Claude at same accuracy. Outside it, API dominates
Research	OpenMythos loops-vs-accuracy saturation	"More loops = deeper reasoning" saturates at `training_max_loop_iters`
Research	OpenMythos extrapolates 1-2 hops past training depth on K-hop	Seed experiment that motivated the project

→ Full v1.0 cross-vendor summary

Status

v0.1 MVP — first end-to-end probe (May 2026)
v0.5 — 4 tasks, 5 adapters, Wilson CIs, cache, Streamlit dashboard
v1.0 — 6 adapter families, 5 tasks, full cross-vendor benchmark (Anthropic/OpenAI/Gemini, current + 2025 prior gen), multi-stage Docker, GitHub Actions CI
v1.1 — OpenMythos head-to-head; cross-paradigm Pareto
v1.2 — self-hosted vLLM with GPU-hour pricing on the same Pareto
v2.0 — 3-paradigm FLOPs measurement tool, dict-lookup task, depth_lens.flops module
v2.1 — LLM-as-judge scorer for open-ended tasks (llm:<judge>:<criterion>), tenant-reply case study
v2.2 — PyPI publish, judge cost folded into recommend $/k-pred, --free-form CLI flag, code-generation task

128 unit tests passing. See ROADMAP.md for what's next.

Install variants

# API-only (no GPU needed) — Anthropic, OpenAI, Gemini, dashboard
pip install -e .[anthropic,openai,gemini,dashboard]

# +looped transformer + HuggingFace local probes
pip install -e .[openmythos,huggingface,anthropic,openai,gemini,dashboard]

# +self-hosted vLLM (vLLM runs separately via docker compose)
pip install -e .[anthropic,openai,gemini,dashboard]   # OpenAI SDK is all that's needed client-side

# Just the framework (BYO adapters)
pip install -e .

Python 3.11+. The bundled OpenMythos training helper assumes CUDA; everything else is happy on CPU or against remote APIs.

Contributing

See CONTRIBUTING.md for how to add a Task or an Adapter (both are ~50 lines + a test) and the conventions used in the bundled implementations.

Citation

@software{depth_lens_2026,
  title  = {depth-lens: Measuring Inference-Time Compute for LLM Production Decisions},
  author = {yutoTachibana},
  year   = {2026},
  url    = {https://github.com/yutoTachibana/depth-lens}
}

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.2.0

May 18, 2026

1.1.0

May 16, 2026

1.0.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

depth_lens-2.2.0.tar.gz (97.2 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

depth_lens-2.2.0-py3-none-any.whl (80.6 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file depth_lens-2.2.0.tar.gz.

File metadata

Download URL: depth_lens-2.2.0.tar.gz
Upload date: May 18, 2026
Size: 97.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for depth_lens-2.2.0.tar.gz
Algorithm	Hash digest
SHA256	`cc6816a1b663e3149c8a83bbbc8f77b3ac8dcc3525c28dc6043c636f791c701a`
MD5	`97e32c609bffa1be81d6e64a135c442a`
BLAKE2b-256	`84b4da7ec7d618ff7c1d8af05cce1f7e6871d43b5c7205b8891aae0e8abb5819`

See more details on using hashes here.

File details

Details for the file depth_lens-2.2.0-py3-none-any.whl.

File metadata

Download URL: depth_lens-2.2.0-py3-none-any.whl
Upload date: May 18, 2026
Size: 80.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for depth_lens-2.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e4324e24848132c07d24664f7dfd725a8d6835afc5d7fa9661791f316e2ab01`
MD5	`b5295375aac27268097b3355bb5f294e`
BLAKE2b-256	`5407c9c50fda72f44fbacbaa9edebcea26618a8180e29e358b582733e9104565`

See more details on using hashes here.

depth-lens 2.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

depth-lens

30 seconds: install, run, decide

Evidence: three real business tasks, three measurements

Case 1 — Tenant-inquiry urgency classifier (real-estate management)

Case 2 — System-monitoring quote estimator (MSP / IT ops)

Case 3 — Tenant-reply quality, judged by LLM (v2.1)

Five patterns these three cases collectively show

Can I measure my task? Three scorer families

What's in the box

6 adapter families

5 built-in probe tasks + custom

Diagnostics every ProbeResult exposes

CLI

Python API

What depth-lens is NOT

Use cases depth-lens is built for

All findings the tool has produced

Status

Install variants

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Diagnostics every `ProbeResult` exposes