Skip to main content

Forecast agent failure from execution traces — spend retries and verification only where needed.

Project description

trace_use

Forecast agent failure from execution traces — spend retries and verification only where they're needed.

trace_use is a self-contained Python toolkit that monitors LLM agents in real time, intercepts bugs mid-turn with deterministic probe tests, and learns from accumulated failures to predict where the next one will land.

It wraps around any tool-use agent — a single line of code — and provides two complementary layers:

Layer When it runs What it does
brain.py — Online Brain During execution, after every tool call Runs deterministic probes on live code, injects targeted corrective feedback mid-turn
pipeline.py — Forecaster After task completion Embeds traces, stores with pass/fail labels, predicts P(fail) via kNN for retry decisions

The key insight: the trace carries the failure signal

How an agent reasons predicts failure independently of whether the final answer looks wrong. Reasoning-only AUC on structured multi-hop tasks reaches 0.84 — wrong reasoning paths diverge from correct ones in embedding space well before the final answer token.

This means failure can be detected mid-generation, not just retrospectively. The signal transfers to task types never seen before (leave-one-out AUC 0.61–0.73). One-liner responses have near-zero signal; multi-step reasoning — tool traces, chain-of-thought — is what makes it work.

Agent type Signal quality Why
Tool-use agent (python_exec, search) High (AUC 0.87) Tool call sequences differ structurally; correct traces show clean execution, failing traces show wrong output or repeated attempts
Text agent with CoT Moderate (AUC 0.68) Wrong reasoning produces wrong intermediate values; a forced step-by-step output creates discriminating structure
One-liner text agent Near chance "Paris" and "Lyon" produce near-identical embeddings

Practical rule: force multi-step output. A CoT wrapper adds signal to any text-only agent:

def cot_agent(prompt: str):
    return haiku(
        prompt + "\n\nThink step by step, showing every intermediate result. "
        "End with 'ANSWER: ...'."
    )

The Brain (brain.py)

BrainAgent attaches to a tool-use agent via a single hook called after every tool execution. It provides three independent failure signals:

Signal 1 — Probe tests (deterministic, fires immediately)

For each task, you register a probe_fn(ns: dict) -> list[str]. After the agent's first python_exec call, the brain re-runs the code in an isolated namespace and calls the probe. If the probe returns failures, it immediately injects specific corrective feedback into the tool result before the next LLM turn:

STOP — your code fails these tests RIGHT NOW:
  ✗ rolling_vol([0.01, 0.01, ..., 0.01], window=10) = 0.0043
    Expected ≈ 0 for a constant-return series.
  FIX: Use returns.rolling(window).std() — not rolling().mean().std().
       The latter computes vol-of-averages, not rolling volatility.
Fix the specific issue above then call python_exec again immediately.

The agent reads this as the tool result and corrects the bug in the next turn. No re-prompting, no retry from scratch.

Signal 2 — kNN over stored code snippets (learned)

A FailureStore keeps code-snippet embeddings with pass/fail labels. Every python_exec input is embedded and compared against stored snippets at query time. When P(fail) exceeds threshold, similar failed snippets surface as context — the agent can see which patterns caused failures before.

Signal 3 — Trajectory prefix kNN + Markov chain (learned)

A TrajectoryStore stores completed runs as ordered sequences of chunk embeddings:

  • Prefix kNN: The mean embedding of the live trajectory's prefix is compared to the same-length prefix of each stored run. kNN fraction of failing runs → P(fail). Works from the first stored example.
  • Markov state failure rate: Once ≥ 30 chunks are stored, k-means discretizes all chunk embeddings into thought-state clusters. Each cluster tracks what fraction of runs visiting it eventually failed. Current chunk → nearest cluster → P(fail | state). Captures: "models that reason this way tend to get the wrong answer."

The two learned signals combine: p_fail = 0.55 × p_markov + 0.45 × p_prefix.

Wiring it up

from trace_use import BrainAgent, build_embedder, tool_agent

embedder = build_embedder()                          # local sentence-transformers, free
brain    = BrainAgent(embedder, threshold=0.30, k=5)

agent = tool_agent(["python_exec"], max_turns=8, model="claude-haiku-4-5-20251001")
agent.monitor = brain                                # single line to attach the brain

def probe_fn(ns: dict) -> list[str]:
    """Return empty list if code is correct; return error strings to trigger intervention."""
    fn = ns.get("compute_returns")
    if not fn:
        return ["compute_returns(prices_df) not defined"]
    import numpy as np, pandas as pd
    prices = pd.DataFrame({"A": [100.0, 110.0]})
    r = fn(prices)
    if r is None or abs(float(r.iloc[0, 0]) - np.log(1.1)) > 0.02:
        return [
            f"compute_returns gives {float(r.iloc[0,0]):.4f} for 100→110, "
            f"expected log return {np.log(1.1):.4f}. "
            "FIX: Use np.log(prices / prices.shift(1)).dropna() not arithmetic returns."
        ]
    return []

for i, task in enumerate(tasks):
    brain.set_task(i, probe_fn=probe_fn)   # register probe (or None for kNN-only)
    brain.reset()
    trace, tokens = agent(task["prompt"])

    code   = extract_code(trace)       # from trace_use import extract_code
    passed = run_checks(code)          # your own pass/fail function

    # IMPORTANT: always store the FIRST-attempt trace with the FIRST-attempt label.
    # Never store retry traces — they conflate recovery patterns with failure patterns.
    brain.store(trace, int(passed))
    if code:
        brain.store_code(code, int(passed))

The brain fires at most 2 times per task to avoid flooding the agent with warnings. Probe tests fire on first-attempt bugs; kNN fires later in the run once enough failures accumulate.

BrainAgent public API

Method Description
brain.set_task(idx, probe_fn=fn) Register the current task index and optional deterministic probe
brain.reset() Clear buffer and intervention counter before a new task
brain.on_tool_call(name, input_dict, result) Hook called by tool_agent on every tool execution; returns modified result or None
brain.store(trace, label, metadata="") Store a completed run's full trajectory (label: 1=pass, 0=fail)
brain.store_code(code, label, metadata="") Store a code snippet with its label
brain.n_stored Total completed runs in the trajectory store
brain._code_interventions How many times the brain fired on the current task

Storage invariant

Always store the first-attempt trace with the first-attempt label — even when a retry fires and recovers a failed component. Storing retry traces conflates recovery patterns with failure patterns and degrades the kNN signal.


The Forecaster (pipeline.py)

Forecaster operates after task completion. It embeds full traces, stores them with labels, and predicts P(fail) for new traces via kNN. Integrates with run_task for end-to-end orchestration.

Quickstart

from trace_use import haiku, opus, build_embedder, run_task, self_judge, Forecaster

embedder   = build_embedder()
forecaster = Forecaster(embedder)
verifier   = self_judge(judge_agent=opus)   # use a different model — self-grading is overconfident

result = run_task(
    task       = "Explain the CAP theorem and name all three properties.",
    agent      = haiku,
    verifier   = verifier,
    forecaster = forecaster,
    retry      = True,
)
print(result.summary())

With a tool-use agent

from trace_use import tool_agent, build_embedder, run_task, code_judge, Forecaster

agent = tool_agent(["python_exec"], max_turns=6)
fc    = Forecaster(build_embedder())

def check(namespace: dict, stdout: str) -> bool:
    fn = namespace.get("binary_search")
    return fn and fn([1,3,5,7,9], 5) == 2 and fn([1,3,5,7,9], 9) == 4

result = run_task(
    task       = "Fix the off-by-one in this binary search: ...",
    agent      = agent,
    verifier   = code_judge(check),
    forecaster = fc,
    retry      = True,
)

Forecaster API

Method / property Description
fc.fit(traces, labels) Bulk-load trace strings and int labels
fc.add(trace, label) Add one trace online after a task completes
fc.predict_fail(trace) float in [0,1] — P(this trace fails)
fc.should_intervene(trace) bool — uses adaptive threshold
fc.explain(trace, k=3) Nearest stored traces with similarity, label, and excerpt
fc.adaptive_threshold Auto-computed: fail_rate + (1 − fail_rate) × 0.20

Cold-start: predictions become reliable at approximately 50 traces with a mix of passes and failures. Before that, predict_fail returns 0.0.


Results

Summary across all evaluations

Eval Model Tasks Baseline +Brain Brain contribution
Multi-hop QA (FanOutQA + MuSiQue) Haiku component AUC 0.85
Python debugging (demo_debug.py) Haiku 29 AUC 0.87
Diverse everyday tasks (demo_general.py) Haiku 40 AUC 0.68
30 diverse domains (eval_fires) Haiku 30 27/30 (90%) 28/30 (93%) +1 task, 5 fires
Hard code + text (eval_hard) Sonnet 14 12/14 (86%) 13/14 (93%) +1 task, 1 fire
30-task intensive (eval_haiku_intensive) Haiku 30 26/30 (87%) 27/30 (90%) +2 tasks, 2 fires
Real-world hard tasks (eval_real_world) Haiku 30 28/30 (93%) 29/30 (97%) +1 task, 2 fires
Extensive benchmark (eval_extensive) Haiku 32 28/32 (88%) 28/32 (88%) 0 tasks, 5 fires
Portfolio Risk Analyzer (eval_project) Haiku 15 13/15 (87%) 14/15 (93%) +1 task, 4 fires

Multi-hop QA — per-component forecasting

Decomposing tasks into atomic sub-questions and forecasting each independently raised AUC from ~0.45 (chance, whole-task labels) to 0.85 on structured multi-hop QA (FanOutQA + MuSiQue).

Metric Value
Per-component failure AUC 0.85
Reasoning-only AUC (no answer text) 0.84
Failures caught at 20% verify budget 31% (1.56× random baseline)
Budget to catch 80% of failures 58–68% of components
Leave-one-task-type-out AUC 0.61–0.73 (zero-shot transfer)

Hard one-shot failures — Sonnet + Brain (eval/eval_hard.py)

14 tasks where Sonnet reliably fails in one shot: 7 hard algorithm tasks (LRU cache, sliding window max, histogram largest rectangle, regex matching, thread-safe bank, burst balloons, Trie) and 7 physics/probability text problems (Bayesian base-rate neglect, rolling sphere inertia, twin paradox, hydrogen emission, buoyancy paradox, Simpson's paradox, Bertrand box).

Baseline +Brain
Code tasks (7) 6/7 7/7
Text tasks (7) 6/7 6/7
Overall 12/14 (86%) 13/14 (93%)

Brain fixed the histogram (largest rectangle) task — Sonnet's first implementation used a naive O(n²) approach that produced wrong results on edge cases. The probe caught it in one fire.

Hard tasks eval — Sonnet + Brain


Real-world hard tasks — 30 tasks (eval/eval_real_world.py)

Tasks drawn from confirmed LLM failure modes in competitive programming and GPQA Diamond research: segment tree with lazy propagation, KMP with overlapping matches, LIS O(n log n), Bellman-Ford with negative cycle detection, Graham scan convex hull, matrix chain multiplication, sliding window median, Manacher's palindrome — and 15 graduate-level science and combinatorics problems (Nernst equation, Compton scattering, de Broglie wavelength, Henderson-Hasselbalch, Michaelis-Menten, CRT, Stirling numbers, derangements).

Baseline +Brain
Code (15 tasks) 13/15 (87%) 14/15 (93%)
Text (15 tasks) 15/15 (100%) 15/15 (100%)
Overall 28/30 (93%) 29/30 (97%)

The brain fixed the Graham scan convex hull — haiku's first implementation failed the probe's edge-case tests (collinear point handling and interior point exclusion). The probe fired twice; haiku corrected both issues in subsequent turns.

Haiku solved 15/15 GPQA-style text problems correctly on first attempt — Henderson-Hasselbalch, Bragg's law, Nernst equation, Compton scattering, CRT, derangements, Catalan and Stirling numbers all passed without brain intervention.

Real-world hard tasks — Haiku + Brain


Extensive hard-task benchmark — 32 tasks (eval/eval_extensive.py)

32 tasks drawn from competitive programming (LiveCodeBench Pro / ICPC-Eval difficulty) and GPQA-style science: lazy-propagation segment tree, bitmask TSP, matrix exponentiation, digit DP, Manacher's, minimum window substring, lexicographic topological sort, Kruskal's MST, plus Python debugging traps and 12 physics/math problems.

Baseline +Brain
Code (20 tasks) 17/20 (85%) 17/20 (85%)
Text (12 tasks) 11/12 (92%) 11/12 (92%)
Overall 28/32 (88%) 28/32 (88%)

Brain fired on 5 tasks (LCS substring, topological sort, segment tree, Kruskal's MST, late-binding closure); none were fixed. This is the clearest illustration of the brain's ceiling: when a task fails because the entire algorithm is wrong — not because of a specific edge-case bug — probe feedback can't recover it. The brain's value is highest when errors are localized (a formula sign, a boundary condition, a missed edge case), not when the approach itself needs rethinking.

The 4 failures are all genuinely hard for Haiku: lazy-propagation segment tree, Kruskal's MST with Union-Find path compression, Python late-binding closure semantics, and the particle-in-a-box energy formula.

Extensive hard-task benchmark — Haiku + Brain


Day-in-the-life project eval — Portfolio Risk Analyzer (eval/eval_project.py)

The most realistic test: 15 sequential tasks that together build a complete stock portfolio risk analyzer from scratch, as a data analyst would in a single working session. Each task builds on the previous — bugs in early tasks propagate downstream.

Tasks (in order):

# Task First attempt +Brain
1 Simulate correlated stock prices (GBM + Cholesky) ✓ ⚡×1
2 Compute log daily returns
3 Rolling 20-day statistics (mean, vol, skew) ✓ ⚡×1 ↑FIXED
4 Annualised covariance matrix
5 Minimum variance portfolio (scipy.optimize)
6 Maximum Sharpe ratio (tangency portfolio)
7 1-day 95% Value at Risk (historical)
8 Conditional VaR / Expected Shortfall
9 Maximum drawdown
10 Annualised Sharpe ratio
11 Portfolio beta to market
12 Risk contribution (marginal to portfolio variance)
13 Stress test: apply shock scenarios ✓ ⚡×2
14 Monthly rebalancing with transaction costs ⚡×2
15 Full portfolio risk report

Overall: 13/15 (87%) baseline → 14/15 (93%) with brain

Portfolio Risk Analyzer — Haiku + Brain, 15-task session

What the brain caught (Task 3 — Rolling statistics):

Haiku's first implementation computed returns.rolling(window).mean().std() — the standard deviation of rolling averages — instead of returns.rolling(window).std(), the rolling standard deviation. These are not the same: the first smooths out variation before measuring it, systematically underestimating volatility.

The probe detected this with a constant-return test series. A constant series ([0.01, 0.01, ..., 0.01]) has zero rolling().std(), but nonzero rolling().mean().std() — so a wrong implementation passes on typical data but fails here. The brain injected:

STOP — your code fails these tests RIGHT NOW:
  ✗ rolling vol of constant-return series = 0.0043, expected ≈ 0.
  FIX: Use returns.rolling(window).std()
       NOT returns.rolling(window).mean().std()
       The latter gives vol-of-averages, not rolling volatility.

Haiku corrected it in the next turn. Without this catch at Task 3, the covariance matrix (Task 4), Sharpe ratio (Task 10), and the final risk report (Task 15) would all have been built on wrong volatility estimates. Early interception prevents silent error propagation — the core benefit in a project context.



Use it in your own projects

Install

pip install trace-use

Or install from source (for the latest or to run evals):

git clone <this-repo>
cd Trace-Optimization
pip install -e .

Set your API key — either export it or drop a .env file at your project root:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...      # only needed if sentence-transformers is unavailable

Then import and go:

from trace_use import BrainAgent, build_embedder, tool_agent
from trace_use import Forecaster, run_task, self_judge, code_judge

Verify the offline test suite at any time (no API key needed):

pytest tests/ -q     # 153 tests, ~1s, fully stubbed

Minimal setup — wrap any task loop in 5 minutes

No probes, no custom verifiers — just the brain's kNN trajectory signal. The brain starts cold and fires warnings as failures accumulate:

from trace_use import BrainAgent, build_embedder, tool_agent

brain         = BrainAgent(build_embedder(), threshold=0.30, k=5)
agent         = tool_agent(["python_exec"], max_turns=8, model="claude-haiku-4-5-20251001")
agent.monitor = brain                 # one line to attach

for i, (prompt, check_fn) in enumerate(my_tasks):
    brain.set_task(i)                 # no probe_fn = kNN-only mode
    brain.reset()

    trace, tokens = agent(prompt)
    passed        = check_fn(trace)   # your existing evaluation logic

    brain.store(trace, int(passed))   # brain learns from this outcome

The brain's kNN signal becomes meaningful around task 15–20 once it has seen a mix of passes and failures. Probe tests (below) work immediately from task 1.


Add probe tests for maximum benefit

Probes are deterministic unit tests that run on the agent's code before the next LLM turn. They are the highest-value part of the brain and work from task 1 with no warm-up.

A good probe:

  • Tests a specific edge case that the model commonly gets wrong
  • Returns an empty list if the code is correct (never fire on correct code)
  • Includes a FIX: clause in the failure message pointing to the exact algorithm error
def probe_sharpe(ns: dict) -> list[str]:
    """Catch the most common Sharpe ratio bug: missing sqrt(252) annualisation."""
    fn = ns.get("sharpe_ratio") or ns.get("annualized_sharpe")
    if not fn:
        return ["sharpe_ratio(returns, rf=0.02) not defined"]

    import numpy as np, pandas as pd
    np.random.seed(0)
    # Known distribution: daily mu=0.0005, vol=0.01 → annualised Sharpe ≈ 0.79
    rets = pd.Series(np.random.normal(0.0005, 0.01, 252))
    sr = fn(rets, rf=0.0)

    if not isinstance(sr, (int, float)):
        return ["sharpe_ratio must return a scalar float"]

    if abs(sr) < 0.3:
        return [
            f"sharpe_ratio = {sr:.4f} — this looks like a daily (non-annualised) figure. "
            "FIX: Multiply by sqrt(periods): "
            "sr = (returns.mean() - rf/periods) / returns.std() * sqrt(periods) "
            "For daily data, sqrt(252) ≈ 15.87."
        ]
    return []

# register per task
brain.set_task(task_idx, probe_fn=probe_sharpe)

Probe design tips:

  • Test the specific algorithm property most likely to be wrong, not the whole function
  • Use a synthetic input where the correct answer is analytically known
  • For numerical functions: compare to a closed-form value, not another implementation
  • Make the FIX: clause algorithmic, not vague — "use np.log(p/p.shift(1))" not "check your formula"

Track what the brain is doing

# See how often the brain fires
print(f"Brain interventions this task: {brain._code_interventions}")
print(f"Total stored trajectories:     {brain.n_stored}")

# After your loop, print a summary
for r in results:
    fires = r.get("fires", 0)
    status = "FIXED" if r["brain_helped"] else ("FIRE" if fires else "")
    print(f"[{'✓' if r['passed'] else '✗'}] {status:5} {r['name']}")

The visualization dashboard updates live during a run:

from eval.viz_brain import BrainViz
from pathlib import Path

viz = BrainViz()

# inside your task loop, after each task:
viz.update(brain, results, fire_counts)
viz.save(Path("my_session.png"))

This produces a 4-panel dark dashboard showing: the Markov thought-state graph (nodes sized by visit count, colored by failure rate), a PCA trajectory map of all stored runs, a per-task pass/fail timeline, and the brain fire rate.


Threshold tuning

The default threshold is 0.30. Lower it if you want earlier, more aggressive intervention; raise it if the brain fires too often on passing tasks.

brain = BrainAgent(embedder, threshold=0.25)   # more aggressive
brain = BrainAgent(embedder, threshold=0.40)   # more conservative

For a new use case, start at 0.30 and observe brain._code_interventions across 10–20 tasks. If the brain fires on tasks that pass without intervention, raise the threshold. If it never fires on tasks that fail, lower it.


Choosing what tasks to add probes for

Not every task needs a probe. Use probes where:

  • There is a specific known failure mode (e.g., a particular edge case, formula direction, or off-by-one)
  • The failure is deterministically testable with a small synthetic input
  • Getting it wrong silently corrupts downstream tasks

Skip probes for:

  • Tasks with ambiguous success criteria
  • Tasks where any reasonable implementation is acceptable
  • Text-generation tasks (probes only work on python_exec)

A realistic day-of-work pattern

This mirrors how the portfolio analyzer eval was run. Each task is a sequential step in a larger project; the brain accumulates signal across all of them:

from trace_use import BrainAgent, build_embedder, tool_agent, extract_code, code_judge
from pathlib   import Path
import json, time

def my_probe(ns: dict) -> list[str]:
    """Your deterministic edge-case test for the current task."""
    fn = ns.get("my_function")
    if not fn:
        return ["my_function not defined"]
    result = fn(known_input)
    if result != expected_output:
        return [f"Got {result}, expected {expected_output}. FIX: ..."]
    return []

def my_check(ns: dict, stdout: str) -> bool:
    """Your full pass/fail verifier — same function you'd pass to code_judge."""
    fn = ns.get("my_function")
    return fn and fn(case1) == ans1 and fn(case2) == ans2

TASKS = [
    {"name": "Step 1", "prompt": "Write my_function that ...", "probe": my_probe},
    {"name": "Step 2", "prompt": "Now extend it to handle ..."},
    # tasks in dependency order — each builds on the previous
]

brain         = BrainAgent(build_embedder(), threshold=0.30, k=5)
agent         = tool_agent(["python_exec"], max_turns=10, model="claude-haiku-4-5-20251001")
agent.monitor = brain
results       = []

for i, task in enumerate(TASKS):
    brain.set_task(i, probe_fn=task.get("probe"))
    brain.reset()
    t0 = time.time()

    trace, tokens = agent(task["prompt"])
    fires         = brain._code_interventions
    code          = extract_code(trace)            # from trace_use import extract_code

    # evaluate with code_judge or your own check function
    verifier   = code_judge(my_check)
    first_pass = verifier(task["prompt"], trace) >= 0.5

    print(f"{i+1}/{len(TASKS)} {'✓' if first_pass else '✗'}  "
          f"{task['name']}  {tokens:,} tok  {time.time()-t0:.0f}s"
          + (f"  [⚡×{fires}]" if fires else ""))

    # store first-attempt trace with first-attempt label
    brain.store(trace, int(first_pass))
    if code:
        brain.store_code(code, int(first_pass))

    results.append({"name": task["name"], "passed": first_pass, "fires": fires})

json.dump(results, open("my_session.json", "w"), indent=2)

Live dashboard (eval/viz_brain.py)

BrainViz renders a 4-panel dark dashboard that updates after every task:

Panel What it shows
Neuron Graph k-means thought-state nodes sized by visit count, colored green→red by failure rate; transition edges weighted by frequency. Raw scatter shown before Markov activates (≥30 chunks).
Trajectory Map PCA-2D of all stored chunk embeddings. Each completed run is a polyline: green=pass, red=fail. Clusters of failure trajectories become visible as the store fills.
Score Timeline Per-task pass/fail bars with cumulative accuracy line.
Fire Report Brain fires per task + cumulative fire rate vs 30% dashed reference.

Embedder

build_embedder() in agents.py prefers local and falls back to remote:

  1. sentence-transformers (preferred): all-MiniLM-L6-v2, 384-dim, free, ~10ms/chunk on CPU, no API key required
  2. OpenAI text-embedding-3-small: 1536-dim, requires OPENAI_API_KEY

Both return L2-normalised float32 vectors and are drop-in interchangeable.


Verifiers (pipeline.py)

The only task-specific input to the pipeline is a Verifier: (question, answer) -> float in [0, 1].

Verifier When to use
code_judge(check_fn) Programmatic — exec the code and run your assertions
gold_judge(gold, agent) Ground-truth string available
self_judge(judge_agent) No ground truth — use a different model to grade
tiered_judge(fast, strong) Save cost — fast model on easy cases, strong on uncertain
self_consistency(resample, samples) No judge — re-run and check agreement
# code_judge: cleanest signal, use when possible
def check(ns: dict, stdout: str) -> bool:
    fn = ns.get("min_variance_portfolio")
    if not fn: return False
    import numpy as np
    cov = np.diag([0.04, 0.16])
    r = fn(cov)
    w = np.array(r["weights"]).flatten()
    return abs(sum(w) - 1.0) < 0.01 and w[0] > 0.5   # more weight on lower-var asset

verifier = code_judge(check)

run_task reference

run_task(
    task            = "...",       # task string
    agent           = haiku,       # callable: prompt -> text or (text, tokens)
    verifier        = verifier,    # callable: (q, trace) -> float
    forecaster      = fc,          # Forecaster instance (optional)
    retriever       = retriever,   # context retriever (optional)
    threshold       = None,        # override adaptive threshold (optional)
    cap             = 8,           # max sub-questions from decompose
    display         = True,        # Rich live terminal output
    retry           = True,        # fire self-critique retry on high P(fail)
    retry_agent     = None,        # different agent for retries
    decompose_agent = None,        # different agent for decomposition
)

Returns a TaskResult with .n_pass, .n_fail, .n_intervened, .summary(), and per-component .components (each with .question, .trace, .p_fail, .label, .retried, .neighbor).


Demos

# classic AUC demos
python demo_general.py          # 40 diverse tasks, CoT haiku, AUC ~0.68
python demo_debug.py            # 29 Python debugging tasks, tool agent, AUC ~0.87
python demo_large.py            # 80+ mixed tasks, full Rich display

# brain interception evals
python eval/eval_fires.py       # 30 diverse domains, haiku, live brain dashboard
python eval/eval_hard.py        # 14 hard one-shot failures, Sonnet
python eval/eval_hard.py --haiku   # same tasks with Haiku
python eval/eval_haiku_intensive.py   # 30 tasks, haiku, intensive
python eval/eval_real_world.py  # 30 hard (segment tree, GPQA-style), haiku
python eval/eval_extensive.py   # 32 tasks, LiveCodeBench Pro / ICPC-Eval difficulty
python eval/eval_project.py     # 15-task portfolio analyzer session, haiku

Repo layout

Path Role
pipeline.py Public API: run_task, decompose, attempt, Forecaster, make_retriever, all verifiers
brain.py BrainAgent, TrajectoryStore, FailureStore — inference-time failure interception
forecast.py Primitives: knn_predict, knn_predict_cross, auc, spearman
display.py Rich live terminal display used by run_task
agents.py haiku, opus, tool_agent, streaming_agent, build_embedder (lazy clients, keys from env/.env)
demo_general.py 40 diverse tasks, CoT haiku, live plot, AUC ~0.68
demo_debug.py 29 Python debugging tasks, tool agent, AUC ~0.87
demo_large.py 80+ mixed tasks, full Rich display
bench/ Vendored benchmark loaders (FanOutQA, MuSiQue)
eval/eval_fires.py 30-task brain eval, diverse domains
eval/eval_hard.py 14 hard one-shot failures, Sonnet + Haiku
eval/eval_haiku_intensive.py 30-task intensive haiku session
eval/eval_real_world.py 30 hard tasks: competitive programming + GPQA-style science
eval/eval_extensive.py 32 tasks: LiveCodeBench Pro / ICPC-Eval difficulty + GPQA-style
eval/eval_project.py 15-task portfolio risk analyzer — the day-in-the-life benchmark
eval/viz_brain.py 4-panel live brain dashboard
eval/results/ All saved charts and JSON run logs
tests/ Offline test suite: test_forecast.py, test_pipeline.py (~150 tests, ~2s, fully stubbed)

Limitations

  • Probe tests need a known failure mode. They catch localized bugs — a wrong formula, a missed edge case, a boundary condition. When the whole algorithm approach is wrong, probe feedback alone can't recover it (seen in the extensive benchmark: segment tree, Kruskal's MST).
  • kNN signal needs warm-up. The trajectory and code-snippet stores need ~50 traces with mixed outcomes before predictions are reliable. Probe tests work immediately; kNN fires later as failures accumulate.
  • Trace richness is required. One-liner responses produce near-identical embeddings regardless of correctness. Use a tool-calling agent or wrap any text model in a CoT prompt that forces step-by-step output.
  • Verifier quality sets the ceiling. Mislabeled traces corrupt the kNN store. Prefer programmatic checks; when using an LLM judge, always use a different model than the one being evaluated.
  • Brain is most impactful in the 15–40% failure band. Above ~90% pass rate, fires are rare and marginal gains are small. Below ~60%, the store fills quickly with failures but the model may need a fundamentally different approach rather than mid-turn correction.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trace_use-0.1.0.tar.gz (73.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trace_use-0.1.0-py3-none-any.whl (45.5 kB view details)

Uploaded Python 3

File details

Details for the file trace_use-0.1.0.tar.gz.

File metadata

  • Download URL: trace_use-0.1.0.tar.gz
  • Upload date:
  • Size: 73.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for trace_use-0.1.0.tar.gz
Algorithm Hash digest
SHA256 35fbd0669787a09d0953b5aa7dc3345918f2a9bd2e70857e1bc469e305666b01
MD5 a5f095be2a21faf7cf587753324a6d67
BLAKE2b-256 de196a7bda918743f27058516621fef911e4910f28c5617634626bc9bdb1fd12

See more details on using hashes here.

File details

Details for the file trace_use-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: trace_use-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for trace_use-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 81a86ba0eaa7837b721d5d5ed67ed901711bb2ca7cb34bbf8d7db2bbfc3ee86c
MD5 6f41003b12f9f9595b75177f8b132ad6
BLAKE2b-256 a06bad687eb8759d9aa4c947b647d7611a91f5267287a783369bec0dca19cc4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page