Skip to main content

Measure and visualize reasoning depth across model families (looped transformers, extended-thinking APIs, agent loops).

Project description

depth-lens

Stop paying for Opus when Haiku does the job — prove it with ~$0.50 of API calls.

Production CI for LLM cost & quality. Sweep a reasoning model's thinking knob on your data and get the cheapest configuration that meets your accuracy bar. Cross-vendor (Anthropic / OpenAI / Gemini / OSS), with Wilson 95% CIs and per-call cost.

日本語版

tests License: MIT Python 3.11+ Status: v1.0 alpha

Switching from Opus 4.7 to Haiku 4.5 saves $123k/year on a 10k-calls/day workload — same accuracy

This plot is a real depth-lens output. Four Anthropic configurations, all scoring 1.00 accuracy on K-hop tier 4, ranged across ~35× in cost. That's the gap the "use the latest / biggest" instinct burns through silently. depth-lens finds the cheapest passing tier on your data in under 10 minutes — see the 30-second install below.

Two other findings from the same bench: Claude Haiku 4.5 collapses on hard 2-SAT at default budget but recovers at 4× budget · Gemini 2.5 Flash was uniquely weak in early 2025 vs same-era Anthropic / OpenAI cheap reasoning

depth-lens is the small OSS tool that finds facts like these:

  • Sweep your model's compute knob (thinking_budget / reasoning_effort / thinking_level / n_loops) across a depth-controllable task
  • Get accuracy curves with Wilson 95% CIs, $/prediction, latency
  • Auto-detect overthinking and effective-reasoning-depth ceilings
  • Compare across 6 adapter families (Anthropic / OpenAI / Gemini / vLLM / HuggingFace / OpenMythos) on 5 built-in tasks or your own JSONL

Why this exists

You're running an LLM in production. How do you know you picked the right model? Anthropic / OpenAI / Google each ship 3 tiers, each with a thinking knob (thinking_budget, reasoning_effort, thinking_level). That's 9+ default configurations × N models. The instinct is "use the latest / biggest" — and you can be 20× overpaying for accuracy you'd get from a cheap model anyway.

The 5 standard questions production teams ask:

  1. "Is Opus actually worth 20× the cost of Haiku for my workload?"
  2. "We're paying $5k/mo on reasoning APIs — where can we cut?"
  3. "Sonnet 4.7 just dropped — does it break the prompts we tuned for 4.6?"
  4. "At what thinking_budget does accuracy plateau on our task?"
  5. "My benchmarks say all models score 1.00 — am I missing a real difference?"

depth-lens answers all five on your own data, in a session, for the cost of a sandwich.

What we don't do

  • We don't run MMLU or GSM8K — single numbers designed to crown frontier models. Production teams already picked a model family; they need to tune within it.
  • We're not LLMThinkBench (HF-only, math-only, single operating point).
  • We're not lm-eval-harness (no compute axis).

We sit in the niche where none of the above covers: the cost-vs-quality curve of frontier reasoning APIs on data you actually run in production.

30-second install + recommend the cheapest model

git clone https://github.com/yutoTachibana/depth-lens.git
cd depth-lens
pip install -e .[anthropic,openai,gemini]

export ANTHROPIC_API_KEY=...     # plus OPENAI_API_KEY / GOOGLE_API_KEY as needed

# Your production prompts → one JSONL line each
cat > my_eval.jsonl <<'EOF'
{"prompt": "Compute (47 * 23 + 19) mod 31.", "target": "5", "depth": 1}
{"prompt": "Compute ((11 * 7 - 4) * 3 + 2) mod 41.", "target": "26", "depth": 1}
EOF

# Find the cheapest model that hits 95% accuracy on YOUR data
depth-lens recommend \
    --models anthropic:claude-haiku-4-5,anthropic:claude-sonnet-4-6,anthropic:claude-opus-4-7 \
    --task custom:my_eval.jsonl:first_int \
    --target-accuracy 0.95 \
    --n-samples 32 \
    --daily-calls 10000
========================================================================================
Target accuracy ≥ 0.95
Probed 6 configurations, 6 passing.
========================================================================================

✅ Passing (cheapest first):
  anthropic:claude-haiku-4-5     d=1  thinking_budget_tokens=4096  acc=1.00  $1.404/k-pred  ← cheapest
  anthropic:claude-haiku-4-5     d=1  thinking_budget_tokens=1024  acc=1.00  $1.430/k-pred
  anthropic:claude-sonnet-4-6    d=1  thinking_budget_tokens=1024  acc=1.00  $2.545/k-pred
  anthropic:claude-sonnet-4-6    d=1  thinking_budget_tokens=4096  acc=1.00  $2.863/k-pred
  anthropic:claude-opus-4-7      d=1  thinking_budget_tokens=1024  acc=1.00  $3.378/k-pred
  anthropic:claude-opus-4-7      d=1  thinking_budget_tokens=4096  acc=1.00  $3.871/k-pred

========================================================================================
At 10,000 calls/day with the cheapest passing config:
  anthropic:claude-haiku-4-5 @ thinking_budget_tokens=4096
  → $14.04/day  $5,124/year

  Switching from anthropic:claude-opus-4-7 @ thinking_budget_tokens=4096 ($38.71/day)
  saves $24.67/day = $9,006/year (64% reduction)

That's it. You now have a defensible answer to "is Opus actually worth 4× the cost of Haiku for my workload?" — backed by a real sweep with Wilson 95% CIs.

The above is real depth-lens output, on a small bench (tier-1 K-hop). Same machinery on a harder task — see docs/findings/v1.0-cost-savings.md — projects up to $123k/year savings for typical production workloads at 10k calls/day.

💡 What you're looking at: depth-lens just ran every (model, budget) combination on your data, scored them against your target, ranked the passers by per-prediction cost, and projected the yearly savings vs the most-expensive passing config. That's the production-CI workflow this tool was built for.

Real findings the tool has produced

We ran depth-lens on every vendor we could get an API key for, on all 5 bundled tasks — current generation and one generation back to keep the cross-vendor comparison fair. Total spend: ~$14. Time invested: a single session.

Finding Why it matters
Switching from Opus 4.7 to Haiku 4.5 saves ~$123k/year on a 10k-call/day task The 4 concrete "tier-downgrade" savings switches depth-lens surfaces, in $
Haiku 4.5 collapses on hard 2-SAT at default budget If you use Haiku for constraint-style problems, set budget≥4096 or pay 2× error rate
Gemini 2.5 Flash was uniquely weak vs same-era Anthropic / OpenAI cheap reasoning When we tested 2025-era models from all 3 vendors, Anthropic Sonnet 4 (May 2025) and o3-mini (Jan 2025) were already at ceiling on K-hop. Only Gemini Flash collapsed. 3.1 Flash-Lite closes the gap.
Claude Opus 4.7 cost varies 10× across (depth × budget) at fixed accuracy Maxing the budget is a strict cost loss for many task classes
OpenAI gpt-5-mini is cheaper-per-token but 3× slower than o4-mini Latency-sensitive paths should pick o4-mini
OpenMythos (looped transformer) extrapolates 1-2 hops past training depth Architecture-specific finding from the experiment that motivated the project

→ See the full v1.0 cross-vendor summary

What's in the box

6 adapter families

Spec Compute knob Cost basis
anthropic:<model> thinking_budget_tokens API
openai:<model> reasoning_effort API
gemini:<model> thinking_budget_tokens (2.5) / auto-mapped to thinking_level (3.x) API
vllm:<model> reasoning_effort (OpenAI-compatible local server) self-hosted
hf:<hf-model-id> max_thinking_tokens (CoT length) local GPU
openmythos n_loops (Recurrent-Depth Transformer) local GPU

API adapters fan requests through a thread pool (max_concurrent); a 1000-prompt probe finishes in minutes, not hours.

5 built-in probe tasks

Task Depth axis Reasoning shape
k-hop K (operators) Forward composition (mod-arithmetic)
parity n (bits) Aggregation (XOR reduction)
graph-reach path length Single BFS pass
state-tracking K (instructions) Vector state (2-counter register machine)
mini-csp n (variables) Search / constraint propagation (2-SAT)
custom:<jsonl>:<scorer> optional depth field Bring your own data

Built-in scorers for custom:: exact, first_int, last_int, yes_no, contains, regex:<pattern>. Verbose CoT outputs are parsed for Final answer: … lines automatically.

Diagnostics

Every ProbeResult exposes:

  • .accuracy[depth][compute] grid in [0, 1]
  • .ci() — Wilson 95% intervals on every cell
  • .effective_depth(threshold=0.5) — biggest depth where some compute level clears the bar
  • .overthinking(depth, tolerance=0.02) — peak compute is not max compute, by how much
  • .cost_per_cell(pricing) — $/prediction given a {input, output, thinking} USD-per-1M dict

CLI

depth-lens recommend ... # find cheapest model meeting your accuracy bar (production workflow)
depth-lens probe ...     # detailed sweep of one model
depth-lens compare ...   # overlay several models on the same task
depth-lens dashboard     # Streamlit UI over your cached probes

Each subcommand has full --help. See docs/playbook/ for end-to-end production scenarios.

Python API

from depth_lens import probe
from depth_lens.tasks import get_task
from depth_lens.adapters.anthropic_adapter import AnthropicAdapter

task = get_task("mini-csp")
adapter = AnthropicAdapter(model="claude-haiku-4-5", task_name="mini-csp")
result = probe(adapter, task, depths=[3, 5, 7, 9], n_samples=16)

print(f"effective depth: {result.effective_depth(0.5)}")
print(f"overthinking @ d=9: {result.overthinking(9)}")
print(f"$/pred @ d=9 mid budget: {result.cost_per_cell({'input': 1.0, 'output': 5.0})[3, 1]}")

How it compares to existing tools

LLMThinkBench usail-hkust bench o1 scaling laws depth-lens
Compute-axis curves (not single point) partial ✅ (o1 only)
Cross-vendor (Claude / o-series / Gemini / OSS) ❌ HF only partial ❌ o1 only
Looped transformer (OpenMythos)
Bring-your-own JSONL
Cost per prediction with sweep
Bounded-depth synthetic probes partial

Closest active competitor is LLMThinkBench which targets math-task overthinking on HuggingFace models at a fixed operating point — orthogonal to depth-lens's compute-axis sweep across vendor APIs.

Status

  • v0.1 MVP — first end-to-end probe (May 2026)
  • v0.5 — 4 tasks, 5 adapters, Wilson CIs, cache, Streamlit dashboard
  • v1.0 — 6 adapter families, 5 tasks, full cross-vendor benchmark (Anthropic/OpenAI/Gemini, current + 2025 prior gen), multi-stage Docker, contributor docs, JA translation, GitHub Actions CI (lint + tests)
  • v1.0 release — PyPI publish (you can already pip install -e . from source)

73 unit tests passing. See ROADMAP.md for what's next.

Install variants

# API-only (no GPU needed) — Anthropic, OpenAI, Gemini, dashboard
pip install -e .[anthropic,openai,gemini,dashboard]

# +looped transformer + HuggingFace local probes
pip install -e .[openmythos,huggingface,anthropic,openai,gemini,dashboard]

# Just the framework (BYO adapters)
pip install -e .

Python 3.11+. The bundled OpenMythos training helper assumes CUDA; everything else is happy on CPU or against remote APIs.

Contributing

See CONTRIBUTING.md for how to add a Task or an Adapter (both are ~50 lines + a test) and the conventions used in the bundled implementations.

Citation

@software{depth_lens_2026,
  title  = {depth-lens: Measuring Reasoning Depth Across Model Families},
  author = {yutoTachibana},
  year   = {2026},
  url    = {https://github.com/yutoTachibana/depth-lens}
}

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

depth_lens-1.0.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

depth_lens-1.0.0-py3-none-any.whl (58.6 kB view details)

Uploaded Python 3

File details

Details for the file depth_lens-1.0.0.tar.gz.

File metadata

  • Download URL: depth_lens-1.0.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for depth_lens-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e2000de9110c9997bf4edee3ef135f1fae9600713bfff9977acb23a0987ee933
MD5 ad71a833da4dd3f6a079d80b3d678870
BLAKE2b-256 fd6b5efb9bbee51fe16f19d013e65a55905ed2cbc23730dab0fb54ca89dc9022

See more details on using hashes here.

File details

Details for the file depth_lens-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: depth_lens-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 58.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for depth_lens-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e1df6e324a3e70b89553f11d2fc0343b1e9612de7daa3805adf0e09b4d06789
MD5 d5a1cfcaa4df0f1d6d517f01708995e4
BLAKE2b-256 c6d33792e1aca99d1799ccfb409ee664c21c4ac707695395dff6e65e5868a071

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page