Measure and visualize reasoning depth across model families (looped transformers, extended-thinking APIs, agent loops).

These details have not been verified by PyPI

Project links

Repository

Project description

depth-lens

Stop paying for Opus when Haiku does the job — prove it with ~$0.50 of API calls.

Production CI for LLM cost & quality. Sweep a reasoning model's thinking knob on your data and get the cheapest configuration that meets your accuracy bar. Cross-vendor (Anthropic / OpenAI / Gemini / OSS), with Wilson 95% CIs and per-call cost.

日本語版

Switching from Opus 4.7 to Haiku 4.5 saves $123k/year on a 10k-calls/day workload — same accuracy

This plot is a real depth-lens output. Four Anthropic configurations, all scoring 1.00 accuracy on K-hop tier 4, ranged across ~35× in cost. That's the gap the "use the latest / biggest" instinct burns through silently. depth-lens finds the cheapest passing tier on your data in under 10 minutes — see the 30-second install below.

Two other findings from the same bench: Claude Haiku 4.5 collapses on hard 2-SAT at default budget but recovers at 4× budget · Gemini 2.5 Flash was uniquely weak in early 2025 vs same-era Anthropic / OpenAI cheap reasoning

depth-lens is the small OSS tool that finds facts like these:

Sweep your model's compute knob (thinking_budget / reasoning_effort / thinking_level / n_loops) across a depth-controllable task
Get accuracy curves with Wilson 95% CIs, $/prediction, latency
Auto-detect overthinking and effective-reasoning-depth ceilings
Compare across 6 adapter families (Anthropic / OpenAI / Gemini / vLLM / HuggingFace / OpenMythos) on 5 built-in tasks or your own JSONL

Why this exists

You're running an LLM in production. How do you know you picked the right model? Anthropic / OpenAI / Google each ship 3 tiers, each with a thinking knob (thinking_budget, reasoning_effort, thinking_level). That's 9+ default configurations × N models. The instinct is "use the latest / biggest" — and you can be 20× overpaying for accuracy you'd get from a cheap model anyway.

The 5 standard questions production teams ask:

"Is Opus actually worth 20× the cost of Haiku for my workload?"
"We're paying $5k/mo on reasoning APIs — where can we cut?"
"Sonnet 4.7 just dropped — does it break the prompts we tuned for 4.6?"
"At what thinking_budget does accuracy plateau on our task?"
"My benchmarks say all models score 1.00 — am I missing a real difference?"

depth-lens answers all five on your own data, in a session, for the cost of a sandwich.

What we don't do

We don't run MMLU or GSM8K — single numbers designed to crown frontier models. Production teams already picked a model family; they need to tune within it.
We're not LLMThinkBench (HF-only, math-only, single operating point).
We're not lm-eval-harness (no compute axis).

We sit in the niche where none of the above covers: the cost-vs-quality curve of frontier reasoning APIs on data you actually run in production.

30-second install + recommend the cheapest model

git clone https://github.com/yutoTachibana/depth-lens.git
cd depth-lens
pip install -e .[anthropic,openai,gemini]

export ANTHROPIC_API_KEY=...     # plus OPENAI_API_KEY / GOOGLE_API_KEY as needed

# Your production prompts → one JSONL line each
cat > my_eval.jsonl <<'EOF'
{"prompt": "Compute (47 * 23 + 19) mod 31.", "target": "5", "depth": 1}
{"prompt": "Compute ((11 * 7 - 4) * 3 + 2) mod 41.", "target": "26", "depth": 1}
EOF

# Find the cheapest model that hits 95% accuracy on YOUR data
depth-lens recommend \
    --models anthropic:claude-haiku-4-5,anthropic:claude-sonnet-4-6,anthropic:claude-opus-4-7 \
    --task custom:my_eval.jsonl:first_int \
    --target-accuracy 0.95 \
    --n-samples 32 \
    --daily-calls 10000

========================================================================================
Target accuracy ≥ 0.95
Probed 6 configurations, 6 passing.
========================================================================================

✅ Passing (cheapest first):
  anthropic:claude-haiku-4-5     d=1  thinking_budget_tokens=4096  acc=1.00  $1.404/k-pred  ← cheapest
  anthropic:claude-haiku-4-5     d=1  thinking_budget_tokens=1024  acc=1.00  $1.430/k-pred
  anthropic:claude-sonnet-4-6    d=1  thinking_budget_tokens=1024  acc=1.00  $2.545/k-pred
  anthropic:claude-sonnet-4-6    d=1  thinking_budget_tokens=4096  acc=1.00  $2.863/k-pred
  anthropic:claude-opus-4-7      d=1  thinking_budget_tokens=1024  acc=1.00  $3.378/k-pred
  anthropic:claude-opus-4-7      d=1  thinking_budget_tokens=4096  acc=1.00  $3.871/k-pred

========================================================================================
At 10,000 calls/day with the cheapest passing config:
  anthropic:claude-haiku-4-5 @ thinking_budget_tokens=4096
  → $14.04/day  $5,124/year

  Switching from anthropic:claude-opus-4-7 @ thinking_budget_tokens=4096 ($38.71/day)
  saves $24.67/day = $9,006/year (64% reduction)

That's it. You now have a defensible answer to "is Opus actually worth 4× the cost of Haiku for my workload?" — backed by a real sweep with Wilson 95% CIs.

The above is real depth-lens output, on a small bench (tier-1 K-hop). Same machinery on a harder task — see docs/findings/v1.0-cost-savings.md — projects up to $123k/year savings for typical production workloads at 10k calls/day.

💡 What you're looking at: depth-lens just ran every (model, budget) combination on your data, scored them against your target, ranked the passers by per-prediction cost, and projected the yearly savings vs the most-expensive passing config. That's the production-CI workflow this tool was built for.

Real findings the tool has produced

We ran depth-lens on every vendor we could get an API key for, on all 5 bundled tasks — current generation and one generation back to keep the cross-vendor comparison fair. Total spend: ~$14. Time invested: a single session.

Finding	Why it matters
Switching from Opus 4.7 to Haiku 4.5 saves ~$123k/year on a 10k-call/day task	The 4 concrete "tier-downgrade" savings switches depth-lens surfaces, in $
Haiku 4.5 collapses on hard 2-SAT at default budget	If you use Haiku for constraint-style problems, set `budget≥4096` or pay 2× error rate
Gemini 2.5 Flash was uniquely weak vs same-era Anthropic / OpenAI cheap reasoning	When we tested 2025-era models from all 3 vendors, Anthropic Sonnet 4 (May 2025) and o3-mini (Jan 2025) were already at ceiling on K-hop. Only Gemini Flash collapsed. 3.1 Flash-Lite closes the gap.
Claude Opus 4.7 cost varies 10× across (depth × budget) at fixed accuracy	Maxing the budget is a strict cost loss for many task classes
OpenAI gpt-5-mini is cheaper-per-token but 3× slower than o4-mini	Latency-sensitive paths should pick o4-mini
OpenMythos (looped transformer) extrapolates 1-2 hops past training depth	Architecture-specific finding from the experiment that motivated the project

→ See the full v1.0 cross-vendor summary

What's in the box

6 adapter families

Spec	Compute knob	Cost basis
`anthropic:<model>`	`thinking_budget_tokens`	API
`openai:<model>`	`reasoning_effort`	API
`gemini:<model>`	`thinking_budget_tokens` (2.5) / auto-mapped to `thinking_level` (3.x)	API
`vllm:<model>`	`reasoning_effort` (OpenAI-compatible local server)	self-hosted
`hf:<hf-model-id>`	`max_thinking_tokens` (CoT length)	local GPU
`openmythos`	`n_loops` (Recurrent-Depth Transformer)	local GPU

API adapters fan requests through a thread pool (max_concurrent); a 1000-prompt probe finishes in minutes, not hours.

5 built-in probe tasks

Task	Depth axis	Reasoning shape
`k-hop`	K (operators)	Forward composition (mod-arithmetic)
`parity`	n (bits)	Aggregation (XOR reduction)
`graph-reach`	path length	Single BFS pass
`state-tracking`	K (instructions)	Vector state (2-counter register machine)
`mini-csp`	n (variables)	Search / constraint propagation (2-SAT)
`custom:<jsonl>:<scorer>`	optional `depth` field	Bring your own data

Built-in scorers for custom:: exact, first_int, last_int, yes_no, contains, regex:<pattern>. Verbose CoT outputs are parsed for Final answer: … lines automatically.

Diagnostics

Every ProbeResult exposes:

.accuracy — [depth][compute] grid in [0, 1]
.ci() — Wilson 95% intervals on every cell
.effective_depth(threshold=0.5) — biggest depth where some compute level clears the bar
.overthinking(depth, tolerance=0.02) — peak compute is not max compute, by how much
.cost_per_cell(pricing) — $/prediction given a {input, output, thinking} USD-per-1M dict

CLI

depth-lens recommend ... # find cheapest model meeting your accuracy bar (production workflow)
depth-lens probe ...     # detailed sweep of one model
depth-lens compare ...   # overlay several models on the same task
depth-lens dashboard     # Streamlit UI over your cached probes

Each subcommand has full --help. See docs/playbook/ for end-to-end production scenarios.

Python API

from depth_lens import probe
from depth_lens.tasks import get_task
from depth_lens.adapters.anthropic_adapter import AnthropicAdapter

task = get_task("mini-csp")
adapter = AnthropicAdapter(model="claude-haiku-4-5", task_name="mini-csp")
result = probe(adapter, task, depths=[3, 5, 7, 9], n_samples=16)

print(f"effective depth: {result.effective_depth(0.5)}")
print(f"overthinking @ d=9: {result.overthinking(9)}")
print(f"$/pred @ d=9 mid budget: {result.cost_per_cell({'input': 1.0, 'output': 5.0})[3, 1]}")

How it compares to existing tools

	LLMThinkBench	usail-hkust bench	o1 scaling laws	depth-lens
Compute-axis curves (not single point)	❌	partial	✅ (o1 only)	✅
Cross-vendor (Claude / o-series / Gemini / OSS)	❌ HF only	partial	❌ o1 only	✅
Looped transformer (OpenMythos)	❌	❌	❌	✅
Bring-your-own JSONL	❌	❌	❌	✅
Cost per prediction with sweep	❌	❌	❌	✅
Bounded-depth synthetic probes	❌	partial	❌	✅

Closest active competitor is LLMThinkBench which targets math-task overthinking on HuggingFace models at a fixed operating point — orthogonal to depth-lens's compute-axis sweep across vendor APIs.

Status

v0.1 MVP — first end-to-end probe (May 2026)
v0.5 — 4 tasks, 5 adapters, Wilson CIs, cache, Streamlit dashboard
v1.0 — 6 adapter families, 5 tasks, full cross-vendor benchmark (Anthropic/OpenAI/Gemini, current + 2025 prior gen), multi-stage Docker, contributor docs, JA translation, GitHub Actions CI (lint + tests)
v1.0 release — PyPI publish (you can already pip install -e . from source)

73 unit tests passing. See ROADMAP.md for what's next.

Install variants

# API-only (no GPU needed) — Anthropic, OpenAI, Gemini, dashboard
pip install -e .[anthropic,openai,gemini,dashboard]

# +looped transformer + HuggingFace local probes
pip install -e .[openmythos,huggingface,anthropic,openai,gemini,dashboard]

# Just the framework (BYO adapters)
pip install -e .

Python 3.11+. The bundled OpenMythos training helper assumes CUDA; everything else is happy on CPU or against remote APIs.

Contributing

See CONTRIBUTING.md for how to add a Task or an Adapter (both are ~50 lines + a test) and the conventions used in the bundled implementations.

Citation

@software{depth_lens_2026,
  title  = {depth-lens: Measuring Reasoning Depth Across Model Families},
  author = {yutoTachibana},
  year   = {2026},
  url    = {https://github.com/yutoTachibana/depth-lens}
}

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

2.2.0

May 18, 2026

1.1.0

May 16, 2026

This version

1.0.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

depth_lens-1.0.0.tar.gz (1.3 MB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

depth_lens-1.0.0-py3-none-any.whl (58.6 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file depth_lens-1.0.0.tar.gz.

File metadata

Download URL: depth_lens-1.0.0.tar.gz
Upload date: May 16, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for depth_lens-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e2000de9110c9997bf4edee3ef135f1fae9600713bfff9977acb23a0987ee933`
MD5	`ad71a833da4dd3f6a079d80b3d678870`
BLAKE2b-256	`fd6b5efb9bbee51fe16f19d013e65a55905ed2cbc23730dab0fb54ca89dc9022`

See more details on using hashes here.

File details

Details for the file depth_lens-1.0.0-py3-none-any.whl.

File metadata

Download URL: depth_lens-1.0.0-py3-none-any.whl
Upload date: May 16, 2026
Size: 58.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for depth_lens-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e1df6e324a3e70b89553f11d2fc0343b1e9612de7daa3805adf0e09b4d06789`
MD5	`d5a1cfcaa4df0f1d6d517f01708995e4`
BLAKE2b-256	`c6d33792e1aca99d1799ccfb409ee664c21c4ac707695395dff6e65e5868a071`

See more details on using hashes here.

depth-lens 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

depth-lens

Why this exists

What we don't do

30-second install + recommend the cheapest model

Real findings the tool has produced

What's in the box

6 adapter families

5 built-in probe tasks

Diagnostics

CLI

Python API

How it compares to existing tools

Status

Install variants

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes