Measure and visualize reasoning depth across model families (looped transformers, extended-thinking APIs, agent loops).
Project description
depth-lens
Stop paying for Opus when Haiku does the job — prove it with ~$0.50 of API calls.
Production CI for LLM cost & quality. Sweep a reasoning model's thinking knob on your data and get the cheapest configuration that meets your accuracy bar. Cross-vendor (Anthropic / OpenAI / Gemini / OSS), with Wilson 95% CIs and per-call cost.
This plot is a real depth-lens output. Four Anthropic configurations, all scoring 1.00 accuracy on K-hop tier 4, ranged across ~35× in cost. That's the gap the "use the latest / biggest" instinct burns through silently. depth-lens finds the cheapest passing tier on your data in under 10 minutes — see the 30-second install below.
Two other findings from the same bench: Claude Haiku 4.5 collapses on hard 2-SAT at default budget but recovers at 4× budget · Gemini 2.5 Flash was uniquely weak in early 2025 vs same-era Anthropic / OpenAI cheap reasoning
depth-lens is the small OSS tool that finds facts like these:
- Sweep your model's compute knob (
thinking_budget/reasoning_effort/thinking_level/n_loops) across a depth-controllable task - Get accuracy curves with Wilson 95% CIs, $/prediction, latency
- Auto-detect overthinking and effective-reasoning-depth ceilings
- Compare across 6 adapter families (Anthropic / OpenAI / Gemini / vLLM / HuggingFace / OpenMythos) on 5 built-in tasks or your own JSONL
Why this exists
You're running an LLM in production. How do you know you picked the right
model? Anthropic / OpenAI / Google each ship 3 tiers, each with a thinking
knob (thinking_budget, reasoning_effort, thinking_level). That's
9+ default configurations × N models. The instinct is "use the latest /
biggest" — and you can be 20× overpaying for accuracy you'd get
from a cheap model anyway.
The 5 standard questions production teams ask:
- "Is Opus actually worth 20× the cost of Haiku for my workload?"
- "We're paying $5k/mo on reasoning APIs — where can we cut?"
- "Sonnet 4.7 just dropped — does it break the prompts we tuned for 4.6?"
- "At what
thinking_budgetdoes accuracy plateau on our task?" - "My benchmarks say all models score 1.00 — am I missing a real difference?"
depth-lens answers all five on your own data, in a session, for the cost of a sandwich.
What we don't do
- We don't run MMLU or GSM8K — single numbers designed to crown frontier models. Production teams already picked a model family; they need to tune within it.
- We're not LLMThinkBench (HF-only, math-only, single operating point).
- We're not lm-eval-harness (no compute axis).
We sit in the niche where none of the above covers: the cost-vs-quality curve of frontier reasoning APIs on data you actually run in production.
30-second install + recommend the cheapest model
git clone https://github.com/yutoTachibana/depth-lens.git
cd depth-lens
pip install -e .[anthropic,openai,gemini]
export ANTHROPIC_API_KEY=... # plus OPENAI_API_KEY / GOOGLE_API_KEY as needed
# Your production prompts → one JSONL line each
cat > my_eval.jsonl <<'EOF'
{"prompt": "Compute (47 * 23 + 19) mod 31.", "target": "5", "depth": 1}
{"prompt": "Compute ((11 * 7 - 4) * 3 + 2) mod 41.", "target": "26", "depth": 1}
EOF
# Find the cheapest model that hits 95% accuracy on YOUR data
depth-lens recommend \
--models anthropic:claude-haiku-4-5,anthropic:claude-sonnet-4-6,anthropic:claude-opus-4-7 \
--task custom:my_eval.jsonl:first_int \
--target-accuracy 0.95 \
--n-samples 32 \
--daily-calls 10000
========================================================================================
Target accuracy ≥ 0.95
Probed 6 configurations, 6 passing.
========================================================================================
✅ Passing (cheapest first):
anthropic:claude-haiku-4-5 d=1 thinking_budget_tokens=4096 acc=1.00 $1.404/k-pred ← cheapest
anthropic:claude-haiku-4-5 d=1 thinking_budget_tokens=1024 acc=1.00 $1.430/k-pred
anthropic:claude-sonnet-4-6 d=1 thinking_budget_tokens=1024 acc=1.00 $2.545/k-pred
anthropic:claude-sonnet-4-6 d=1 thinking_budget_tokens=4096 acc=1.00 $2.863/k-pred
anthropic:claude-opus-4-7 d=1 thinking_budget_tokens=1024 acc=1.00 $3.378/k-pred
anthropic:claude-opus-4-7 d=1 thinking_budget_tokens=4096 acc=1.00 $3.871/k-pred
========================================================================================
At 10,000 calls/day with the cheapest passing config:
anthropic:claude-haiku-4-5 @ thinking_budget_tokens=4096
→ $14.04/day $5,124/year
Switching from anthropic:claude-opus-4-7 @ thinking_budget_tokens=4096 ($38.71/day)
saves $24.67/day = $9,006/year (64% reduction)
That's it. You now have a defensible answer to "is Opus actually worth 4× the cost of Haiku for my workload?" — backed by a real sweep with Wilson 95% CIs.
The above is real depth-lens output, on a small bench (tier-1 K-hop). Same machinery on a harder task — see docs/findings/v1.0-cost-savings.md — projects up to $123k/year savings for typical production workloads at 10k calls/day.
💡 What you're looking at: depth-lens just ran every (model, budget) combination on your data, scored them against your target, ranked the passers by per-prediction cost, and projected the yearly savings vs the most-expensive passing config. That's the production-CI workflow this tool was built for.
Real findings the tool has produced
We ran depth-lens on every vendor we could get an API key for, on all 5 bundled tasks — current generation and one generation back to keep the cross-vendor comparison fair. Total spend: ~$14. Time invested: a single session.
| Finding | Why it matters |
|---|---|
| Switching from Opus 4.7 to Haiku 4.5 saves ~$123k/year on a 10k-call/day task | The 4 concrete "tier-downgrade" savings switches depth-lens surfaces, in $ |
| Haiku 4.5 collapses on hard 2-SAT at default budget | If you use Haiku for constraint-style problems, set budget≥4096 or pay 2× error rate |
| Gemini 2.5 Flash was uniquely weak vs same-era Anthropic / OpenAI cheap reasoning | When we tested 2025-era models from all 3 vendors, Anthropic Sonnet 4 (May 2025) and o3-mini (Jan 2025) were already at ceiling on K-hop. Only Gemini Flash collapsed. 3.1 Flash-Lite closes the gap. |
| Claude Opus 4.7 cost varies 10× across (depth × budget) at fixed accuracy | Maxing the budget is a strict cost loss for many task classes |
| OpenAI gpt-5-mini is cheaper-per-token but 3× slower than o4-mini | Latency-sensitive paths should pick o4-mini |
| OpenMythos (looped transformer) extrapolates 1-2 hops past training depth | Architecture-specific finding from the experiment that motivated the project |
→ See the full v1.0 cross-vendor summary
What's in the box
6 adapter families
| Spec | Compute knob | Cost basis |
|---|---|---|
anthropic:<model> |
thinking_budget_tokens |
API |
openai:<model> |
reasoning_effort |
API |
gemini:<model> |
thinking_budget_tokens (2.5) / auto-mapped to thinking_level (3.x) |
API |
vllm:<model> |
reasoning_effort (OpenAI-compatible local server) |
self-hosted |
hf:<hf-model-id> |
max_thinking_tokens (CoT length) |
local GPU |
openmythos |
n_loops (Recurrent-Depth Transformer) |
local GPU |
API adapters fan requests through a thread pool (max_concurrent); a
1000-prompt probe finishes in minutes, not hours.
5 built-in probe tasks
| Task | Depth axis | Reasoning shape |
|---|---|---|
k-hop |
K (operators) | Forward composition (mod-arithmetic) |
parity |
n (bits) | Aggregation (XOR reduction) |
graph-reach |
path length | Single BFS pass |
state-tracking |
K (instructions) | Vector state (2-counter register machine) |
mini-csp |
n (variables) | Search / constraint propagation (2-SAT) |
custom:<jsonl>:<scorer> |
optional depth field |
Bring your own data |
Built-in scorers for custom:: exact, first_int, last_int, yes_no,
contains, regex:<pattern>. Verbose CoT outputs are parsed for
Final answer: … lines automatically.
Diagnostics
Every ProbeResult exposes:
.accuracy—[depth][compute]grid in[0, 1].ci()— Wilson 95% intervals on every cell.effective_depth(threshold=0.5)— biggest depth where some compute level clears the bar.overthinking(depth, tolerance=0.02)— peak compute is not max compute, by how much.cost_per_cell(pricing)— $/prediction given a{input, output, thinking}USD-per-1M dict
CLI
depth-lens recommend ... # find cheapest model meeting your accuracy bar (production workflow)
depth-lens probe ... # detailed sweep of one model
depth-lens compare ... # overlay several models on the same task
depth-lens dashboard # Streamlit UI over your cached probes
Each subcommand has full --help. See docs/playbook/
for end-to-end production scenarios.
Python API
from depth_lens import probe
from depth_lens.tasks import get_task
from depth_lens.adapters.anthropic_adapter import AnthropicAdapter
task = get_task("mini-csp")
adapter = AnthropicAdapter(model="claude-haiku-4-5", task_name="mini-csp")
result = probe(adapter, task, depths=[3, 5, 7, 9], n_samples=16)
print(f"effective depth: {result.effective_depth(0.5)}")
print(f"overthinking @ d=9: {result.overthinking(9)}")
print(f"$/pred @ d=9 mid budget: {result.cost_per_cell({'input': 1.0, 'output': 5.0})[3, 1]}")
How it compares to existing tools
| LLMThinkBench | usail-hkust bench | o1 scaling laws | depth-lens | |
|---|---|---|---|---|
| Compute-axis curves (not single point) | ❌ | partial | ✅ (o1 only) | ✅ |
| Cross-vendor (Claude / o-series / Gemini / OSS) | ❌ HF only | partial | ❌ o1 only | ✅ |
| Looped transformer (OpenMythos) | ❌ | ❌ | ❌ | ✅ |
| Bring-your-own JSONL | ❌ | ❌ | ❌ | ✅ |
| Cost per prediction with sweep | ❌ | ❌ | ❌ | ✅ |
| Bounded-depth synthetic probes | ❌ | partial | ❌ | ✅ |
Closest active competitor is LLMThinkBench which targets math-task overthinking on HuggingFace models at a fixed operating point — orthogonal to depth-lens's compute-axis sweep across vendor APIs.
Status
- v0.1 MVP — first end-to-end probe (May 2026)
- v0.5 — 4 tasks, 5 adapters, Wilson CIs, cache, Streamlit dashboard
- v1.0 — 6 adapter families, 5 tasks, full cross-vendor benchmark (Anthropic/OpenAI/Gemini, current + 2025 prior gen), multi-stage Docker, contributor docs, JA translation, GitHub Actions CI (lint + tests)
- v1.0 release — PyPI publish (you can already
pip install -e .from source)
73 unit tests passing. See ROADMAP.md for what's next.
Install variants
# API-only (no GPU needed) — Anthropic, OpenAI, Gemini, dashboard
pip install -e .[anthropic,openai,gemini,dashboard]
# +looped transformer + HuggingFace local probes
pip install -e .[openmythos,huggingface,anthropic,openai,gemini,dashboard]
# Just the framework (BYO adapters)
pip install -e .
Python 3.11+. The bundled OpenMythos training helper assumes CUDA; everything else is happy on CPU or against remote APIs.
Contributing
See CONTRIBUTING.md for how to add a Task or an Adapter (both are ~50 lines + a test) and the conventions used in the bundled implementations.
Citation
@software{depth_lens_2026,
title = {depth-lens: Measuring Reasoning Depth Across Model Families},
author = {yutoTachibana},
year = {2026},
url = {https://github.com/yutoTachibana/depth-lens}
}
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file depth_lens-1.0.0.tar.gz.
File metadata
- Download URL: depth_lens-1.0.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2000de9110c9997bf4edee3ef135f1fae9600713bfff9977acb23a0987ee933
|
|
| MD5 |
ad71a833da4dd3f6a079d80b3d678870
|
|
| BLAKE2b-256 |
fd6b5efb9bbee51fe16f19d013e65a55905ed2cbc23730dab0fb54ca89dc9022
|
File details
Details for the file depth_lens-1.0.0-py3-none-any.whl.
File metadata
- Download URL: depth_lens-1.0.0-py3-none-any.whl
- Upload date:
- Size: 58.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e1df6e324a3e70b89553f11d2fc0343b1e9612de7daa3805adf0e09b4d06789
|
|
| MD5 |
d5a1cfcaa4df0f1d6d517f01708995e4
|
|
| BLAKE2b-256 |
c6d33792e1aca99d1799ccfb409ee664c21c4ac707695395dff6e65e5868a071
|