Forecast agent failure from execution traces — spend retries and verification only where needed.
Project description
trace_use
Forecast agent failure from execution traces — spend retries and verification only where they're needed.
trace_use is a self-contained Python toolkit that monitors LLM agents in real time, intercepts bugs mid-turn with deterministic probe tests, and learns from accumulated failures to predict where the next one will land.
It wraps around any tool-use agent — a single line of code — and provides two complementary layers:
| Layer | When it runs | What it does |
|---|---|---|
brain.py — Online Brain |
During execution, after every tool call | Runs deterministic probes on live code, injects targeted corrective feedback mid-turn |
pipeline.py — Forecaster |
After task completion | Embeds traces, stores with pass/fail labels, predicts P(fail) via kNN for retry decisions |
The key insight: the trace carries the failure signal
How an agent reasons predicts failure independently of whether the final answer looks wrong. Reasoning-only AUC on structured multi-hop tasks reaches 0.84 — wrong reasoning paths diverge from correct ones in embedding space well before the final answer token.
This means failure can be detected mid-generation, not just retrospectively. The signal transfers to task types never seen before (leave-one-out AUC 0.61–0.73). One-liner responses have near-zero signal; multi-step reasoning — tool traces, chain-of-thought — is what makes it work.
| Agent type | Signal quality | Why |
|---|---|---|
Tool-use agent (python_exec, search) |
High (AUC 0.87) | Tool call sequences differ structurally; correct traces show clean execution, failing traces show wrong output or repeated attempts |
| Text agent with CoT | Moderate (AUC 0.68) | Wrong reasoning produces wrong intermediate values; a forced step-by-step output creates discriminating structure |
| One-liner text agent | Near chance | "Paris" and "Lyon" produce near-identical embeddings |
Practical rule: force multi-step output. A CoT wrapper adds signal to any text-only agent:
def cot_agent(prompt: str):
return haiku(
prompt + "\n\nThink step by step, showing every intermediate result. "
"End with 'ANSWER: ...'."
)
The Brain (brain.py)
BrainAgent attaches to a tool-use agent via a single hook called after every tool execution. It provides three independent failure signals:
Signal 1 — Probe tests (deterministic, fires immediately)
For each task, you register a probe_fn(ns: dict) -> list[str]. After the agent's first python_exec call, the brain re-runs the code in an isolated namespace and calls the probe. If the probe returns failures, it immediately injects specific corrective feedback into the tool result before the next LLM turn:
STOP — your code fails these tests RIGHT NOW:
✗ rolling_vol([0.01, 0.01, ..., 0.01], window=10) = 0.0043
Expected ≈ 0 for a constant-return series.
FIX: Use returns.rolling(window).std() — not rolling().mean().std().
The latter computes vol-of-averages, not rolling volatility.
Fix the specific issue above then call python_exec again immediately.
The agent reads this as the tool result and corrects the bug in the next turn. No re-prompting, no retry from scratch.
Signal 2 — kNN over stored code snippets (learned)
A FailureStore keeps code-snippet embeddings with pass/fail labels. Every python_exec input is embedded and compared against stored snippets at query time. When P(fail) exceeds threshold, similar failed snippets surface as context — the agent can see which patterns caused failures before.
Signal 3 — Trajectory prefix kNN + Markov chain (learned)
A TrajectoryStore stores completed runs as ordered sequences of chunk embeddings:
- Prefix kNN: The mean embedding of the live trajectory's prefix is compared to the same-length prefix of each stored run. kNN fraction of failing runs → P(fail). Works from the first stored example.
- Markov state failure rate: Once ≥ 30 chunks are stored, k-means discretizes all chunk embeddings into thought-state clusters. Each cluster tracks what fraction of runs visiting it eventually failed. Current chunk → nearest cluster → P(fail | state). Captures: "models that reason this way tend to get the wrong answer."
The two learned signals combine: p_fail = 0.55 × p_markov + 0.45 × p_prefix.
Wiring it up
from trace_use import BrainAgent, build_embedder, tool_agent
embedder = build_embedder() # local sentence-transformers, free
brain = BrainAgent(embedder, threshold=0.30, k=5)
agent = tool_agent(["python_exec"], max_turns=8, model="claude-haiku-4-5-20251001")
agent.monitor = brain # single line to attach the brain
def probe_fn(ns: dict) -> list[str]:
"""Return empty list if code is correct; return error strings to trigger intervention."""
fn = ns.get("compute_returns")
if not fn:
return ["compute_returns(prices_df) not defined"]
import numpy as np, pandas as pd
prices = pd.DataFrame({"A": [100.0, 110.0]})
r = fn(prices)
if r is None or abs(float(r.iloc[0, 0]) - np.log(1.1)) > 0.02:
return [
f"compute_returns gives {float(r.iloc[0,0]):.4f} for 100→110, "
f"expected log return {np.log(1.1):.4f}. "
"FIX: Use np.log(prices / prices.shift(1)).dropna() not arithmetic returns."
]
return []
for i, task in enumerate(tasks):
brain.set_task(i, probe_fn=probe_fn) # register probe (or None for kNN-only)
brain.reset()
trace, tokens = agent(task["prompt"])
code = extract_code(trace) # from trace_use import extract_code
passed = run_checks(code) # your own pass/fail function
# IMPORTANT: always store the FIRST-attempt trace with the FIRST-attempt label.
# Never store retry traces — they conflate recovery patterns with failure patterns.
brain.store(trace, int(passed))
if code:
brain.store_code(code, int(passed))
The brain fires at most 2 times per task to avoid flooding the agent with warnings. Probe tests fire on first-attempt bugs; kNN fires later in the run once enough failures accumulate.
BrainAgent public API
| Method | Description |
|---|---|
brain.set_task(idx, probe_fn=fn) |
Register the current task index and optional deterministic probe |
brain.reset() |
Clear buffer and intervention counter before a new task |
brain.on_tool_call(name, input_dict, result) |
Hook called by tool_agent on every tool execution; returns modified result or None |
brain.store(trace, label, metadata="") |
Store a completed run's full trajectory (label: 1=pass, 0=fail) |
brain.store_code(code, label, metadata="") |
Store a code snippet with its label |
brain.n_stored |
Total completed runs in the trajectory store |
brain._code_interventions |
How many times the brain fired on the current task |
Storage invariant
Always store the first-attempt trace with the first-attempt label — even when a retry fires and recovers a failed component. Storing retry traces conflates recovery patterns with failure patterns and degrades the kNN signal.
The Forecaster (pipeline.py)
Forecaster operates after task completion. It embeds full traces, stores them with labels, and predicts P(fail) for new traces via kNN. Integrates with run_task for end-to-end orchestration.
Quickstart
from trace_use import haiku, opus, build_embedder, run_task, self_judge, Forecaster
embedder = build_embedder()
forecaster = Forecaster(embedder)
verifier = self_judge(judge_agent=opus) # use a different model — self-grading is overconfident
result = run_task(
task = "Explain the CAP theorem and name all three properties.",
agent = haiku,
verifier = verifier,
forecaster = forecaster,
retry = True,
)
print(result.summary())
With a tool-use agent
from trace_use import tool_agent, build_embedder, run_task, code_judge, Forecaster
agent = tool_agent(["python_exec"], max_turns=6)
fc = Forecaster(build_embedder())
def check(namespace: dict, stdout: str) -> bool:
fn = namespace.get("binary_search")
return fn and fn([1,3,5,7,9], 5) == 2 and fn([1,3,5,7,9], 9) == 4
result = run_task(
task = "Fix the off-by-one in this binary search: ...",
agent = agent,
verifier = code_judge(check),
forecaster = fc,
retry = True,
)
Forecaster API
| Method / property | Description |
|---|---|
fc.fit(traces, labels) |
Bulk-load trace strings and int labels |
fc.add(trace, label) |
Add one trace online after a task completes |
fc.predict_fail(trace) |
float in [0,1] — P(this trace fails) |
fc.should_intervene(trace) |
bool — uses adaptive threshold |
fc.explain(trace, k=3) |
Nearest stored traces with similarity, label, and excerpt |
fc.adaptive_threshold |
Auto-computed: fail_rate + (1 − fail_rate) × 0.20 |
Cold-start: predictions become reliable at approximately 50 traces with a mix of passes and failures. Before that, predict_fail returns 0.0.
Results
Summary across all evaluations
| Eval | Model | Tasks | Baseline | +Brain | Brain contribution |
|---|---|---|---|---|---|
| Multi-hop QA (FanOutQA + MuSiQue) | Haiku | component | — | AUC 0.85 | — |
Python debugging (demo_debug.py) |
Haiku | 29 | — | AUC 0.87 | — |
Diverse everyday tasks (demo_general.py) |
Haiku | 40 | — | AUC 0.68 | — |
30 diverse domains (eval_fires) |
Haiku | 30 | 27/30 (90%) | 28/30 (93%) | +1 task, 5 fires |
Hard code + text (eval_hard) |
Sonnet | 14 | 12/14 (86%) | 13/14 (93%) | +1 task, 1 fire |
30-task intensive (eval_haiku_intensive) |
Haiku | 30 | 26/30 (87%) | 27/30 (90%) | +2 tasks, 2 fires |
Real-world hard tasks (eval_real_world) |
Haiku | 30 | 28/30 (93%) | 29/30 (97%) | +1 task, 2 fires |
Extensive benchmark (eval_extensive) |
Haiku | 32 | 28/32 (88%) | 28/32 (88%) | 0 tasks, 5 fires |
Portfolio Risk Analyzer (eval_project) |
Haiku | 15 | 13/15 (87%) | 14/15 (93%) | +1 task, 4 fires |
Multi-hop QA — per-component forecasting
Decomposing tasks into atomic sub-questions and forecasting each independently raised AUC from ~0.45 (chance, whole-task labels) to 0.85 on structured multi-hop QA (FanOutQA + MuSiQue).
| Metric | Value |
|---|---|
| Per-component failure AUC | 0.85 |
| Reasoning-only AUC (no answer text) | 0.84 |
| Failures caught at 20% verify budget | 31% (1.56× random baseline) |
| Budget to catch 80% of failures | 58–68% of components |
| Leave-one-task-type-out AUC | 0.61–0.73 (zero-shot transfer) |
Hard one-shot failures — Sonnet + Brain (eval/eval_hard.py)
14 tasks where Sonnet reliably fails in one shot: 7 hard algorithm tasks (LRU cache, sliding window max, histogram largest rectangle, regex matching, thread-safe bank, burst balloons, Trie) and 7 physics/probability text problems (Bayesian base-rate neglect, rolling sphere inertia, twin paradox, hydrogen emission, buoyancy paradox, Simpson's paradox, Bertrand box).
| Baseline | +Brain | |
|---|---|---|
| Code tasks (7) | 6/7 | 7/7 |
| Text tasks (7) | 6/7 | 6/7 |
| Overall | 12/14 (86%) | 13/14 (93%) |
Brain fixed the histogram (largest rectangle) task — Sonnet's first implementation used a naive O(n²) approach that produced wrong results on edge cases. The probe caught it in one fire.
Real-world hard tasks — 30 tasks (eval/eval_real_world.py)
Tasks drawn from confirmed LLM failure modes in competitive programming and GPQA Diamond research: segment tree with lazy propagation, KMP with overlapping matches, LIS O(n log n), Bellman-Ford with negative cycle detection, Graham scan convex hull, matrix chain multiplication, sliding window median, Manacher's palindrome — and 15 graduate-level science and combinatorics problems (Nernst equation, Compton scattering, de Broglie wavelength, Henderson-Hasselbalch, Michaelis-Menten, CRT, Stirling numbers, derangements).
| Baseline | +Brain | |
|---|---|---|
| Code (15 tasks) | 13/15 (87%) | 14/15 (93%) |
| Text (15 tasks) | 15/15 (100%) | 15/15 (100%) |
| Overall | 28/30 (93%) | 29/30 (97%) |
The brain fixed the Graham scan convex hull — haiku's first implementation failed the probe's edge-case tests (collinear point handling and interior point exclusion). The probe fired twice; haiku corrected both issues in subsequent turns.
Haiku solved 15/15 GPQA-style text problems correctly on first attempt — Henderson-Hasselbalch, Bragg's law, Nernst equation, Compton scattering, CRT, derangements, Catalan and Stirling numbers all passed without brain intervention.
Extensive hard-task benchmark — 32 tasks (eval/eval_extensive.py)
32 tasks drawn from competitive programming (LiveCodeBench Pro / ICPC-Eval difficulty) and GPQA-style science: lazy-propagation segment tree, bitmask TSP, matrix exponentiation, digit DP, Manacher's, minimum window substring, lexicographic topological sort, Kruskal's MST, plus Python debugging traps and 12 physics/math problems.
| Baseline | +Brain | |
|---|---|---|
| Code (20 tasks) | 17/20 (85%) | 17/20 (85%) |
| Text (12 tasks) | 11/12 (92%) | 11/12 (92%) |
| Overall | 28/32 (88%) | 28/32 (88%) |
Brain fired on 5 tasks (LCS substring, topological sort, segment tree, Kruskal's MST, late-binding closure); none were fixed. This is the clearest illustration of the brain's ceiling: when a task fails because the entire algorithm is wrong — not because of a specific edge-case bug — probe feedback can't recover it. The brain's value is highest when errors are localized (a formula sign, a boundary condition, a missed edge case), not when the approach itself needs rethinking.
The 4 failures are all genuinely hard for Haiku: lazy-propagation segment tree, Kruskal's MST with Union-Find path compression, Python late-binding closure semantics, and the particle-in-a-box energy formula.
Day-in-the-life project eval — Portfolio Risk Analyzer (eval/eval_project.py)
The most realistic test: 15 sequential tasks that together build a complete stock portfolio risk analyzer from scratch, as a data analyst would in a single working session. Each task builds on the previous — bugs in early tasks propagate downstream.
Tasks (in order):
| # | Task | First attempt | +Brain |
|---|---|---|---|
| 1 | Simulate correlated stock prices (GBM + Cholesky) | ✓ | ✓ ⚡×1 |
| 2 | Compute log daily returns | ✓ | ✓ |
| 3 | Rolling 20-day statistics (mean, vol, skew) | ✗ | ✓ ⚡×1 ↑FIXED |
| 4 | Annualised covariance matrix | ✓ | ✓ |
| 5 | Minimum variance portfolio (scipy.optimize) | ✓ | ✓ |
| 6 | Maximum Sharpe ratio (tangency portfolio) | ✓ | ✓ |
| 7 | 1-day 95% Value at Risk (historical) | ✓ | ✓ |
| 8 | Conditional VaR / Expected Shortfall | ✓ | ✓ |
| 9 | Maximum drawdown | ✓ | ✓ |
| 10 | Annualised Sharpe ratio | ✓ | ✓ |
| 11 | Portfolio beta to market | ✓ | ✓ |
| 12 | Risk contribution (marginal to portfolio variance) | ✓ | ✓ |
| 13 | Stress test: apply shock scenarios | ✓ | ✓ ⚡×2 |
| 14 | Monthly rebalancing with transaction costs | ✗ | ✗ ⚡×2 |
| 15 | Full portfolio risk report | ✓ | ✓ |
Overall: 13/15 (87%) baseline → 14/15 (93%) with brain
What the brain caught (Task 3 — Rolling statistics):
Haiku's first implementation computed returns.rolling(window).mean().std() — the standard deviation of rolling averages — instead of returns.rolling(window).std(), the rolling standard deviation. These are not the same: the first smooths out variation before measuring it, systematically underestimating volatility.
The probe detected this with a constant-return test series. A constant series ([0.01, 0.01, ..., 0.01]) has zero rolling().std(), but nonzero rolling().mean().std() — so a wrong implementation passes on typical data but fails here. The brain injected:
STOP — your code fails these tests RIGHT NOW:
✗ rolling vol of constant-return series = 0.0043, expected ≈ 0.
FIX: Use returns.rolling(window).std()
NOT returns.rolling(window).mean().std()
The latter gives vol-of-averages, not rolling volatility.
Haiku corrected it in the next turn. Without this catch at Task 3, the covariance matrix (Task 4), Sharpe ratio (Task 10), and the final risk report (Task 15) would all have been built on wrong volatility estimates. Early interception prevents silent error propagation — the core benefit in a project context.
Use it in your own projects
Install
pip install trace-use
Or install from source (for the latest or to run evals):
git clone <this-repo>
cd Trace-Optimization
pip install -e .
Set your API key — either export it or drop a .env file at your project root:
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-... # only needed if sentence-transformers is unavailable
Then import and go:
from trace_use import BrainAgent, build_embedder, tool_agent
from trace_use import Forecaster, run_task, self_judge, code_judge
Verify the offline test suite at any time (no API key needed):
pytest tests/ -q # 153 tests, ~1s, fully stubbed
Minimal setup — wrap any task loop in 5 minutes
No probes, no custom verifiers — just the brain's kNN trajectory signal. The brain starts cold and fires warnings as failures accumulate:
from trace_use import BrainAgent, build_embedder, tool_agent
brain = BrainAgent(build_embedder(), threshold=0.30, k=5)
agent = tool_agent(["python_exec"], max_turns=8, model="claude-haiku-4-5-20251001")
agent.monitor = brain # one line to attach
for i, (prompt, check_fn) in enumerate(my_tasks):
brain.set_task(i) # no probe_fn = kNN-only mode
brain.reset()
trace, tokens = agent(prompt)
passed = check_fn(trace) # your existing evaluation logic
brain.store(trace, int(passed)) # brain learns from this outcome
The brain's kNN signal becomes meaningful around task 15–20 once it has seen a mix of passes and failures. Probe tests (below) work immediately from task 1.
Add probe tests for maximum benefit
Probes are deterministic unit tests that run on the agent's code before the next LLM turn. They are the highest-value part of the brain and work from task 1 with no warm-up.
A good probe:
- Tests a specific edge case that the model commonly gets wrong
- Returns an empty list if the code is correct (never fire on correct code)
- Includes a
FIX:clause in the failure message pointing to the exact algorithm error
def probe_sharpe(ns: dict) -> list[str]:
"""Catch the most common Sharpe ratio bug: missing sqrt(252) annualisation."""
fn = ns.get("sharpe_ratio") or ns.get("annualized_sharpe")
if not fn:
return ["sharpe_ratio(returns, rf=0.02) not defined"]
import numpy as np, pandas as pd
np.random.seed(0)
# Known distribution: daily mu=0.0005, vol=0.01 → annualised Sharpe ≈ 0.79
rets = pd.Series(np.random.normal(0.0005, 0.01, 252))
sr = fn(rets, rf=0.0)
if not isinstance(sr, (int, float)):
return ["sharpe_ratio must return a scalar float"]
if abs(sr) < 0.3:
return [
f"sharpe_ratio = {sr:.4f} — this looks like a daily (non-annualised) figure. "
"FIX: Multiply by sqrt(periods): "
"sr = (returns.mean() - rf/periods) / returns.std() * sqrt(periods) "
"For daily data, sqrt(252) ≈ 15.87."
]
return []
# register per task
brain.set_task(task_idx, probe_fn=probe_sharpe)
Probe design tips:
- Test the specific algorithm property most likely to be wrong, not the whole function
- Use a synthetic input where the correct answer is analytically known
- For numerical functions: compare to a closed-form value, not another implementation
- Make the FIX: clause algorithmic, not vague — "use
np.log(p/p.shift(1))" not "check your formula"
Track what the brain is doing
# See how often the brain fires
print(f"Brain interventions this task: {brain._code_interventions}")
print(f"Total stored trajectories: {brain.n_stored}")
# After your loop, print a summary
for r in results:
fires = r.get("fires", 0)
status = "FIXED" if r["brain_helped"] else ("FIRE" if fires else "")
print(f"[{'✓' if r['passed'] else '✗'}] {status:5} {r['name']}")
The visualization dashboard updates live during a run:
from eval.viz_brain import BrainViz
from pathlib import Path
viz = BrainViz()
# inside your task loop, after each task:
viz.update(brain, results, fire_counts)
viz.save(Path("my_session.png"))
This produces a 4-panel dark dashboard showing: the Markov thought-state graph (nodes sized by visit count, colored by failure rate), a PCA trajectory map of all stored runs, a per-task pass/fail timeline, and the brain fire rate.
Threshold tuning
The default threshold is 0.30. Lower it if you want earlier, more aggressive intervention; raise it if the brain fires too often on passing tasks.
brain = BrainAgent(embedder, threshold=0.25) # more aggressive
brain = BrainAgent(embedder, threshold=0.40) # more conservative
For a new use case, start at 0.30 and observe brain._code_interventions across 10–20 tasks. If the brain fires on tasks that pass without intervention, raise the threshold. If it never fires on tasks that fail, lower it.
Choosing what tasks to add probes for
Not every task needs a probe. Use probes where:
- There is a specific known failure mode (e.g., a particular edge case, formula direction, or off-by-one)
- The failure is deterministically testable with a small synthetic input
- Getting it wrong silently corrupts downstream tasks
Skip probes for:
- Tasks with ambiguous success criteria
- Tasks where any reasonable implementation is acceptable
- Text-generation tasks (probes only work on
python_exec)
A realistic day-of-work pattern
This mirrors how the portfolio analyzer eval was run. Each task is a sequential step in a larger project; the brain accumulates signal across all of them:
from trace_use import BrainAgent, build_embedder, tool_agent, extract_code, code_judge
from pathlib import Path
import json, time
def my_probe(ns: dict) -> list[str]:
"""Your deterministic edge-case test for the current task."""
fn = ns.get("my_function")
if not fn:
return ["my_function not defined"]
result = fn(known_input)
if result != expected_output:
return [f"Got {result}, expected {expected_output}. FIX: ..."]
return []
def my_check(ns: dict, stdout: str) -> bool:
"""Your full pass/fail verifier — same function you'd pass to code_judge."""
fn = ns.get("my_function")
return fn and fn(case1) == ans1 and fn(case2) == ans2
TASKS = [
{"name": "Step 1", "prompt": "Write my_function that ...", "probe": my_probe},
{"name": "Step 2", "prompt": "Now extend it to handle ..."},
# tasks in dependency order — each builds on the previous
]
brain = BrainAgent(build_embedder(), threshold=0.30, k=5)
agent = tool_agent(["python_exec"], max_turns=10, model="claude-haiku-4-5-20251001")
agent.monitor = brain
results = []
for i, task in enumerate(TASKS):
brain.set_task(i, probe_fn=task.get("probe"))
brain.reset()
t0 = time.time()
trace, tokens = agent(task["prompt"])
fires = brain._code_interventions
code = extract_code(trace) # from trace_use import extract_code
# evaluate with code_judge or your own check function
verifier = code_judge(my_check)
first_pass = verifier(task["prompt"], trace) >= 0.5
print(f"{i+1}/{len(TASKS)} {'✓' if first_pass else '✗'} "
f"{task['name']} {tokens:,} tok {time.time()-t0:.0f}s"
+ (f" [⚡×{fires}]" if fires else ""))
# store first-attempt trace with first-attempt label
brain.store(trace, int(first_pass))
if code:
brain.store_code(code, int(first_pass))
results.append({"name": task["name"], "passed": first_pass, "fires": fires})
json.dump(results, open("my_session.json", "w"), indent=2)
Live dashboard (eval/viz_brain.py)
BrainViz renders a 4-panel dark dashboard that updates after every task:
| Panel | What it shows |
|---|---|
| Neuron Graph | k-means thought-state nodes sized by visit count, colored green→red by failure rate; transition edges weighted by frequency. Raw scatter shown before Markov activates (≥30 chunks). |
| Trajectory Map | PCA-2D of all stored chunk embeddings. Each completed run is a polyline: green=pass, red=fail. Clusters of failure trajectories become visible as the store fills. |
| Score Timeline | Per-task pass/fail bars with cumulative accuracy line. |
| Fire Report | Brain fires per task + cumulative fire rate vs 30% dashed reference. |
Embedder
build_embedder() in agents.py prefers local and falls back to remote:
sentence-transformers(preferred):all-MiniLM-L6-v2, 384-dim, free, ~10ms/chunk on CPU, no API key required- OpenAI
text-embedding-3-small: 1536-dim, requiresOPENAI_API_KEY
Both return L2-normalised float32 vectors and are drop-in interchangeable.
Verifiers (pipeline.py)
The only task-specific input to the pipeline is a Verifier: (question, answer) -> float in [0, 1].
| Verifier | When to use |
|---|---|
code_judge(check_fn) |
Programmatic — exec the code and run your assertions |
gold_judge(gold, agent) |
Ground-truth string available |
self_judge(judge_agent) |
No ground truth — use a different model to grade |
tiered_judge(fast, strong) |
Save cost — fast model on easy cases, strong on uncertain |
self_consistency(resample, samples) |
No judge — re-run and check agreement |
# code_judge: cleanest signal, use when possible
def check(ns: dict, stdout: str) -> bool:
fn = ns.get("min_variance_portfolio")
if not fn: return False
import numpy as np
cov = np.diag([0.04, 0.16])
r = fn(cov)
w = np.array(r["weights"]).flatten()
return abs(sum(w) - 1.0) < 0.01 and w[0] > 0.5 # more weight on lower-var asset
verifier = code_judge(check)
run_task reference
run_task(
task = "...", # task string
agent = haiku, # callable: prompt -> text or (text, tokens)
verifier = verifier, # callable: (q, trace) -> float
forecaster = fc, # Forecaster instance (optional)
retriever = retriever, # context retriever (optional)
threshold = None, # override adaptive threshold (optional)
cap = 8, # max sub-questions from decompose
display = True, # Rich live terminal output
retry = True, # fire self-critique retry on high P(fail)
retry_agent = None, # different agent for retries
decompose_agent = None, # different agent for decomposition
)
Returns a TaskResult with .n_pass, .n_fail, .n_intervened, .summary(), and per-component .components (each with .question, .trace, .p_fail, .label, .retried, .neighbor).
Demos
# classic AUC demos
python demo_general.py # 40 diverse tasks, CoT haiku, AUC ~0.68
python demo_debug.py # 29 Python debugging tasks, tool agent, AUC ~0.87
python demo_large.py # 80+ mixed tasks, full Rich display
# brain interception evals
python eval/eval_fires.py # 30 diverse domains, haiku, live brain dashboard
python eval/eval_hard.py # 14 hard one-shot failures, Sonnet
python eval/eval_hard.py --haiku # same tasks with Haiku
python eval/eval_haiku_intensive.py # 30 tasks, haiku, intensive
python eval/eval_real_world.py # 30 hard (segment tree, GPQA-style), haiku
python eval/eval_extensive.py # 32 tasks, LiveCodeBench Pro / ICPC-Eval difficulty
python eval/eval_project.py # 15-task portfolio analyzer session, haiku
Repo layout
| Path | Role |
|---|---|
pipeline.py |
Public API: run_task, decompose, attempt, Forecaster, make_retriever, all verifiers |
brain.py |
BrainAgent, TrajectoryStore, FailureStore — inference-time failure interception |
forecast.py |
Primitives: knn_predict, knn_predict_cross, auc, spearman |
display.py |
Rich live terminal display used by run_task |
agents.py |
haiku, opus, tool_agent, streaming_agent, build_embedder (lazy clients, keys from env/.env) |
demo_general.py |
40 diverse tasks, CoT haiku, live plot, AUC ~0.68 |
demo_debug.py |
29 Python debugging tasks, tool agent, AUC ~0.87 |
demo_large.py |
80+ mixed tasks, full Rich display |
bench/ |
Vendored benchmark loaders (FanOutQA, MuSiQue) |
eval/eval_fires.py |
30-task brain eval, diverse domains |
eval/eval_hard.py |
14 hard one-shot failures, Sonnet + Haiku |
eval/eval_haiku_intensive.py |
30-task intensive haiku session |
eval/eval_real_world.py |
30 hard tasks: competitive programming + GPQA-style science |
eval/eval_extensive.py |
32 tasks: LiveCodeBench Pro / ICPC-Eval difficulty + GPQA-style |
eval/eval_project.py |
15-task portfolio risk analyzer — the day-in-the-life benchmark |
eval/viz_brain.py |
4-panel live brain dashboard |
eval/results/ |
All saved charts and JSON run logs |
tests/ |
Offline test suite: test_forecast.py, test_pipeline.py (~150 tests, ~2s, fully stubbed) |
Limitations
- Probe tests need a known failure mode. They catch localized bugs — a wrong formula, a missed edge case, a boundary condition. When the whole algorithm approach is wrong, probe feedback alone can't recover it (seen in the extensive benchmark: segment tree, Kruskal's MST).
- kNN signal needs warm-up. The trajectory and code-snippet stores need ~50 traces with mixed outcomes before predictions are reliable. Probe tests work immediately; kNN fires later as failures accumulate.
- Trace richness is required. One-liner responses produce near-identical embeddings regardless of correctness. Use a tool-calling agent or wrap any text model in a CoT prompt that forces step-by-step output.
- Verifier quality sets the ceiling. Mislabeled traces corrupt the kNN store. Prefer programmatic checks; when using an LLM judge, always use a different model than the one being evaluated.
- Brain is most impactful in the 15–40% failure band. Above ~90% pass rate, fires are rare and marginal gains are small. Below ~60%, the store fills quickly with failures but the model may need a fundamentally different approach rather than mid-turn correction.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trace_use-0.1.0.tar.gz.
File metadata
- Download URL: trace_use-0.1.0.tar.gz
- Upload date:
- Size: 73.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35fbd0669787a09d0953b5aa7dc3345918f2a9bd2e70857e1bc469e305666b01
|
|
| MD5 |
a5f095be2a21faf7cf587753324a6d67
|
|
| BLAKE2b-256 |
de196a7bda918743f27058516621fef911e4910f28c5617634626bc9bdb1fd12
|
File details
Details for the file trace_use-0.1.0-py3-none-any.whl.
File metadata
- Download URL: trace_use-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81a86ba0eaa7837b721d5d5ed67ed901711bb2ca7cb34bbf8d7db2bbfc3ee86c
|
|
| MD5 |
6f41003b12f9f9595b75177f8b132ad6
|
|
| BLAKE2b-256 |
a06bad687eb8759d9aa4c947b647d7611a91f5267287a783369bec0dca19cc4e
|