Forecast agent failure from execution traces — spend retries and verification only where needed.

These details have not been verified by PyPI

Project description

trace_use

Forecast agent failure from execution traces — spend retries and verification only where they're needed.

trace_use is a self-contained Python toolkit that monitors LLM agents in real time, intercepts bugs mid-turn with deterministic probe tests, and learns from accumulated failures to predict where the next one will land.

It wraps around any tool-use agent — a single line of code — and provides two complementary layers:

Layer	When it runs	What it does
`brain.py` — Online Brain	During execution, after every tool call	Runs deterministic probes on live code, injects targeted corrective feedback mid-turn
`pipeline.py` — Forecaster	After task completion	Embeds traces, stores with pass/fail labels, predicts P(fail) via kNN for retry decisions

The key insight: the trace carries the failure signal

How an agent reasons predicts failure independently of whether the final answer looks wrong. Reasoning-only AUC on structured multi-hop tasks reaches 0.84 — wrong reasoning paths diverge from correct ones in embedding space well before the final answer token.

This means failure can be detected mid-generation, not just retrospectively. The signal transfers to task types never seen before (leave-one-out AUC 0.61–0.73). One-liner responses have near-zero signal; multi-step reasoning — tool traces, chain-of-thought — is what makes it work.

Agent type	Signal quality	Why
Tool-use agent (`python_exec`, search)	High (AUC 0.87)	Tool call sequences differ structurally; correct traces show clean execution, failing traces show wrong output or repeated attempts
Text agent with CoT	Moderate (AUC 0.68)	Wrong reasoning produces wrong intermediate values; a forced step-by-step output creates discriminating structure
One-liner text agent	Near chance	`"Paris"` and `"Lyon"` produce near-identical embeddings

Practical rule: force multi-step output. A CoT wrapper adds signal to any text-only agent:

def cot_agent(prompt: str):
    return haiku(
        prompt + "\n\nThink step by step, showing every intermediate result. "
        "End with 'ANSWER: ...'."
    )

The Brain (`brain.py`)

BrainAgent attaches to a tool-use agent via a single hook called after every tool execution. It provides three independent failure signals:

Signal 1 — Probe tests (deterministic, fires immediately)

For each task, you register a probe_fn(ns: dict) -> list[str]. After the agent's first python_exec call, the brain re-runs the code in an isolated namespace and calls the probe. If the probe returns failures, it immediately injects specific corrective feedback into the tool result before the next LLM turn:

STOP — your code fails these tests RIGHT NOW:
  ✗ rolling_vol([0.01, 0.01, ..., 0.01], window=10) = 0.0043
    Expected ≈ 0 for a constant-return series.
  FIX: Use returns.rolling(window).std() — not rolling().mean().std().
       The latter computes vol-of-averages, not rolling volatility.
Fix the specific issue above then call python_exec again immediately.

The agent reads this as the tool result and corrects the bug in the next turn. No re-prompting, no retry from scratch.

Signal 2 — kNN over stored code snippets (learned)

A FailureStore keeps code-snippet embeddings with pass/fail labels. Every python_exec input is embedded and compared against stored snippets at query time. When P(fail) exceeds threshold, similar failed snippets surface as context — the agent can see which patterns caused failures before.

Signal 3 — Trajectory prefix kNN + Markov chain (learned)

A TrajectoryStore stores completed runs as ordered sequences of chunk embeddings:

Prefix kNN: The mean embedding of the live trajectory's prefix is compared to the same-length prefix of each stored run. kNN fraction of failing runs → P(fail). Works from the first stored example.
Markov state failure rate: Once ≥ 30 chunks are stored, k-means discretizes all chunk embeddings into thought-state clusters. Each cluster tracks what fraction of runs visiting it eventually failed. Current chunk → nearest cluster → P(fail | state). Captures: "models that reason this way tend to get the wrong answer."

The two learned signals combine: p_fail = 0.55 × p_markov + 0.45 × p_prefix.

Wiring it up

from trace_use import BrainAgent, build_embedder, tool_agent

embedder = build_embedder()                          # local sentence-transformers, free
brain    = BrainAgent(embedder, threshold=0.30, k=5)

agent = tool_agent(["python_exec"], max_turns=8, model="claude-haiku-4-5-20251001")
agent.monitor = brain                                # single line to attach the brain

def probe_fn(ns: dict) -> list[str]:
    """Return empty list if code is correct; return error strings to trigger intervention."""
    fn = ns.get("compute_returns")
    if not fn:
        return ["compute_returns(prices_df) not defined"]
    import numpy as np, pandas as pd
    prices = pd.DataFrame({"A": [100.0, 110.0]})
    r = fn(prices)
    if r is None or abs(float(r.iloc[0, 0]) - np.log(1.1)) > 0.02:
        return [
            f"compute_returns gives {float(r.iloc[0,0]):.4f} for 100→110, "
            f"expected log return {np.log(1.1):.4f}. "
            "FIX: Use np.log(prices / prices.shift(1)).dropna() not arithmetic returns."
        ]
    return []

for i, task in enumerate(tasks):
    brain.set_task(i, probe_fn=probe_fn)   # register probe (or None for kNN-only)
    brain.reset()
    trace, tokens = agent(task["prompt"])

    code   = extract_code(trace)       # from trace_use import extract_code
    passed = run_checks(code)          # your own pass/fail function

    # IMPORTANT: always store the FIRST-attempt trace with the FIRST-attempt label.
    # Never store retry traces — they conflate recovery patterns with failure patterns.
    brain.store(trace, int(passed))
    if code:
        brain.store_code(code, int(passed))

The brain fires at most 2 times per task to avoid flooding the agent with warnings. Probe tests fire on first-attempt bugs; kNN fires later in the run once enough failures accumulate.

`BrainAgent` public API

Method	Description
`brain.set_task(idx, probe_fn=fn)`	Register the current task index and optional deterministic probe
`brain.reset()`	Clear buffer and intervention counter before a new task
`brain.on_tool_call(name, input_dict, result)`	Hook called by `tool_agent` on every tool execution; returns modified result or `None`
`brain.store(trace, label, metadata="")`	Store a completed run's full trajectory (label: 1=pass, 0=fail)
`brain.store_code(code, label, metadata="")`	Store a code snippet with its label
`brain.n_stored`	Total completed runs in the trajectory store
`brain._code_interventions`	How many times the brain fired on the current task

Storage invariant

Always store the first-attempt trace with the first-attempt label — even when a retry fires and recovers a failed component. Storing retry traces conflates recovery patterns with failure patterns and degrades the kNN signal.

The Forecaster (`pipeline.py`)

Forecaster operates after task completion. It embeds full traces, stores them with labels, and predicts P(fail) for new traces via kNN. Integrates with run_task for end-to-end orchestration.

Quickstart

from trace_use import haiku, opus, build_embedder, run_task, self_judge, Forecaster

embedder   = build_embedder()
forecaster = Forecaster(embedder)
verifier   = self_judge(judge_agent=opus)   # use a different model — self-grading is overconfident

result = run_task(
    task       = "Explain the CAP theorem and name all three properties.",
    agent      = haiku,
    verifier   = verifier,
    forecaster = forecaster,
    retry      = True,
)
print(result.summary())

With a tool-use agent

from trace_use import tool_agent, build_embedder, run_task, code_judge, Forecaster

agent = tool_agent(["python_exec"], max_turns=6)
fc    = Forecaster(build_embedder())

def check(namespace: dict, stdout: str) -> bool:
    fn = namespace.get("binary_search")
    return fn and fn([1,3,5,7,9], 5) == 2 and fn([1,3,5,7,9], 9) == 4

result = run_task(
    task       = "Fix the off-by-one in this binary search: ...",
    agent      = agent,
    verifier   = code_judge(check),
    forecaster = fc,
    retry      = True,
)

`Forecaster` API

Method / property	Description
`fc.fit(traces, labels)`	Bulk-load trace strings and int labels
`fc.add(trace, label)`	Add one trace online after a task completes
`fc.predict_fail(trace)`	`float` in `[0,1]` — P(this trace fails)
`fc.should_intervene(trace)`	`bool` — uses adaptive threshold
`fc.explain(trace, k=3)`	Nearest stored traces with similarity, label, and excerpt
`fc.adaptive_threshold`	Auto-computed: `fail_rate + (1 − fail_rate) × 0.20`

Cold-start: predictions become reliable at approximately 50 traces with a mix of passes and failures. Before that, predict_fail returns 0.0.

Results

Summary across all evaluations

Eval	Model	Tasks	Baseline	+Brain	Brain contribution
Multi-hop QA (FanOutQA + MuSiQue)	Haiku	component	—	AUC 0.85	—
Python debugging (`demo_debug.py`)	Haiku	29	—	AUC 0.87	—
Diverse everyday tasks (`demo_general.py`)	Haiku	40	—	AUC 0.68	—
30 diverse domains (`eval_fires`)	Haiku	30	27/30 (90%)	28/30 (93%)	+1 task, 5 fires
Hard code + text (`eval_hard`)	Sonnet	14	12/14 (86%)	13/14 (93%)	+1 task, 1 fire
30-task intensive (`eval_haiku_intensive`)	Haiku	30	26/30 (87%)	27/30 (90%)	+2 tasks, 2 fires
Real-world hard tasks (`eval_real_world`)	Haiku	30	28/30 (93%)	29/30 (97%)	+1 task, 2 fires
Extensive benchmark (`eval_extensive`)	Haiku	32	28/32 (88%)	28/32 (88%)	0 tasks, 5 fires
Portfolio Risk Analyzer (`eval_project`)	Haiku	15	13/15 (87%)	14/15 (93%)	+1 task, 4 fires

Multi-hop QA — per-component forecasting

Decomposing tasks into atomic sub-questions and forecasting each independently raised AUC from ~0.45 (chance, whole-task labels) to 0.85 on structured multi-hop QA (FanOutQA + MuSiQue).

Metric	Value
Per-component failure AUC	0.85
Reasoning-only AUC (no answer text)	0.84
Failures caught at 20% verify budget	31% (1.56× random baseline)
Budget to catch 80% of failures	58–68% of components
Leave-one-task-type-out AUC	0.61–0.73 (zero-shot transfer)

Hard one-shot failures — Sonnet + Brain (`eval/eval_hard.py`)

14 tasks where Sonnet reliably fails in one shot: 7 hard algorithm tasks (LRU cache, sliding window max, histogram largest rectangle, regex matching, thread-safe bank, burst balloons, Trie) and 7 physics/probability text problems (Bayesian base-rate neglect, rolling sphere inertia, twin paradox, hydrogen emission, buoyancy paradox, Simpson's paradox, Bertrand box).

	Baseline	+Brain
Code tasks (7)	6/7	7/7
Text tasks (7)	6/7	6/7
Overall	12/14 (86%)	13/14 (93%)

Brain fixed the histogram (largest rectangle) task — Sonnet's first implementation used a naive O(n²) approach that produced wrong results on edge cases. The probe caught it in one fire.

Hard tasks eval — Sonnet + Brain

Real-world hard tasks — 30 tasks (`eval/eval_real_world.py`)

Tasks drawn from confirmed LLM failure modes in competitive programming and GPQA Diamond research: segment tree with lazy propagation, KMP with overlapping matches, LIS O(n log n), Bellman-Ford with negative cycle detection, Graham scan convex hull, matrix chain multiplication, sliding window median, Manacher's palindrome — and 15 graduate-level science and combinatorics problems (Nernst equation, Compton scattering, de Broglie wavelength, Henderson-Hasselbalch, Michaelis-Menten, CRT, Stirling numbers, derangements).

	Baseline	+Brain
Code (15 tasks)	13/15 (87%)	14/15 (93%)
Text (15 tasks)	15/15 (100%)	15/15 (100%)
Overall	28/30 (93%)	29/30 (97%)

The brain fixed the Graham scan convex hull — haiku's first implementation failed the probe's edge-case tests (collinear point handling and interior point exclusion). The probe fired twice; haiku corrected both issues in subsequent turns.

Haiku solved 15/15 GPQA-style text problems correctly on first attempt — Henderson-Hasselbalch, Bragg's law, Nernst equation, Compton scattering, CRT, derangements, Catalan and Stirling numbers all passed without brain intervention.

Real-world hard tasks — Haiku + Brain

Extensive hard-task benchmark — 32 tasks (`eval/eval_extensive.py`)

32 tasks drawn from competitive programming (LiveCodeBench Pro / ICPC-Eval difficulty) and GPQA-style science: lazy-propagation segment tree, bitmask TSP, matrix exponentiation, digit DP, Manacher's, minimum window substring, lexicographic topological sort, Kruskal's MST, plus Python debugging traps and 12 physics/math problems.

	Baseline	+Brain
Code (20 tasks)	17/20 (85%)	17/20 (85%)
Text (12 tasks)	11/12 (92%)	11/12 (92%)
Overall	28/32 (88%)	28/32 (88%)

Brain fired on 5 tasks (LCS substring, topological sort, segment tree, Kruskal's MST, late-binding closure); none were fixed. This is the clearest illustration of the brain's ceiling: when a task fails because the entire algorithm is wrong — not because of a specific edge-case bug — probe feedback can't recover it. The brain's value is highest when errors are localized (a formula sign, a boundary condition, a missed edge case), not when the approach itself needs rethinking.

The 4 failures are all genuinely hard for Haiku: lazy-propagation segment tree, Kruskal's MST with Union-Find path compression, Python late-binding closure semantics, and the particle-in-a-box energy formula.

Extensive hard-task benchmark — Haiku + Brain

Day-in-the-life project eval — Portfolio Risk Analyzer (`eval/eval_project.py`)

The most realistic test: 15 sequential tasks that together build a complete stock portfolio risk analyzer from scratch, as a data analyst would in a single working session. Each task builds on the previous — bugs in early tasks propagate downstream.

Tasks (in order):

#	Task	First attempt	+Brain
1	Simulate correlated stock prices (GBM + Cholesky)	✓	✓ ⚡×1
2	Compute log daily returns	✓	✓
3	Rolling 20-day statistics (mean, vol, skew)	✗	✓ ⚡×1 ↑FIXED
4	Annualised covariance matrix	✓	✓
5	Minimum variance portfolio (scipy.optimize)	✓	✓
6	Maximum Sharpe ratio (tangency portfolio)	✓	✓
7	1-day 95% Value at Risk (historical)	✓	✓
8	Conditional VaR / Expected Shortfall	✓	✓
9	Maximum drawdown	✓	✓
10	Annualised Sharpe ratio	✓	✓
11	Portfolio beta to market	✓	✓
12	Risk contribution (marginal to portfolio variance)	✓	✓
13	Stress test: apply shock scenarios	✓	✓ ⚡×2
14	Monthly rebalancing with transaction costs	✗	✗ ⚡×2
15	Full portfolio risk report	✓	✓

Overall: 13/15 (87%) baseline → 14/15 (93%) with brain

Portfolio Risk Analyzer — Haiku + Brain, 15-task session

What the brain caught (Task 3 — Rolling statistics):

Haiku's first implementation computed returns.rolling(window).mean().std() — the standard deviation of rolling averages — instead of returns.rolling(window).std(), the rolling standard deviation. These are not the same: the first smooths out variation before measuring it, systematically underestimating volatility.

The probe detected this with a constant-return test series. A constant series ([0.01, 0.01, ..., 0.01]) has zero rolling().std(), but nonzero rolling().mean().std() — so a wrong implementation passes on typical data but fails here. The brain injected:

STOP — your code fails these tests RIGHT NOW:
  ✗ rolling vol of constant-return series = 0.0043, expected ≈ 0.
  FIX: Use returns.rolling(window).std()
       NOT returns.rolling(window).mean().std()
       The latter gives vol-of-averages, not rolling volatility.

Haiku corrected it in the next turn. Without this catch at Task 3, the covariance matrix (Task 4), Sharpe ratio (Task 10), and the final risk report (Task 15) would all have been built on wrong volatility estimates. Early interception prevents silent error propagation — the core benefit in a project context.

Use it in your own projects

Install

pip install trace-use

Or install from source (for the latest or to run evals):

git clone <this-repo>
cd Trace-Optimization
pip install -e .

Set your API key — either export it or drop a .env file at your project root:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...      # only needed if sentence-transformers is unavailable

Then import and go:

from trace_use import BrainAgent, build_embedder, tool_agent
from trace_use import Forecaster, run_task, self_judge, code_judge

Verify the offline test suite at any time (no API key needed):

pytest tests/ -q     # 153 tests, ~1s, fully stubbed

Minimal setup — wrap any task loop in 5 minutes

No probes, no custom verifiers — just the brain's kNN trajectory signal. The brain starts cold and fires warnings as failures accumulate:

from trace_use import BrainAgent, build_embedder, tool_agent

brain         = BrainAgent(build_embedder(), threshold=0.30, k=5)
agent         = tool_agent(["python_exec"], max_turns=8, model="claude-haiku-4-5-20251001")
agent.monitor = brain                 # one line to attach

for i, (prompt, check_fn) in enumerate(my_tasks):
    brain.set_task(i)                 # no probe_fn = kNN-only mode
    brain.reset()

    trace, tokens = agent(prompt)
    passed        = check_fn(trace)   # your existing evaluation logic

    brain.store(trace, int(passed))   # brain learns from this outcome

The brain's kNN signal becomes meaningful around task 15–20 once it has seen a mix of passes and failures. Probe tests (below) work immediately from task 1.

Add probe tests for maximum benefit

Probes are deterministic unit tests that run on the agent's code before the next LLM turn. They are the highest-value part of the brain and work from task 1 with no warm-up.

A good probe:

Tests a specific edge case that the model commonly gets wrong
Returns an empty list if the code is correct (never fire on correct code)
Includes a FIX: clause in the failure message pointing to the exact algorithm error

def probe_sharpe(ns: dict) -> list[str]:
    """Catch the most common Sharpe ratio bug: missing sqrt(252) annualisation."""
    fn = ns.get("sharpe_ratio") or ns.get("annualized_sharpe")
    if not fn:
        return ["sharpe_ratio(returns, rf=0.02) not defined"]

    import numpy as np, pandas as pd
    np.random.seed(0)
    # Known distribution: daily mu=0.0005, vol=0.01 → annualised Sharpe ≈ 0.79
    rets = pd.Series(np.random.normal(0.0005, 0.01, 252))
    sr = fn(rets, rf=0.0)

    if not isinstance(sr, (int, float)):
        return ["sharpe_ratio must return a scalar float"]

    if abs(sr) < 0.3:
        return [
            f"sharpe_ratio = {sr:.4f} — this looks like a daily (non-annualised) figure. "
            "FIX: Multiply by sqrt(periods): "
            "sr = (returns.mean() - rf/periods) / returns.std() * sqrt(periods) "
            "For daily data, sqrt(252) ≈ 15.87."
        ]
    return []

# register per task
brain.set_task(task_idx, probe_fn=probe_sharpe)

Probe design tips:

Test the specific algorithm property most likely to be wrong, not the whole function
Use a synthetic input where the correct answer is analytically known
For numerical functions: compare to a closed-form value, not another implementation
Make the FIX: clause algorithmic, not vague — "use np.log(p/p.shift(1))" not "check your formula"

Track what the brain is doing

# See how often the brain fires
print(f"Brain interventions this task: {brain._code_interventions}")
print(f"Total stored trajectories:     {brain.n_stored}")

# After your loop, print a summary
for r in results:
    fires = r.get("fires", 0)
    status = "FIXED" if r["brain_helped"] else ("FIRE" if fires else "")
    print(f"[{'✓' if r['passed'] else '✗'}] {status:5} {r['name']}")

The visualization dashboard updates live during a run:

from eval.viz_brain import BrainViz
from pathlib import Path

viz = BrainViz()

# inside your task loop, after each task:
viz.update(brain, results, fire_counts)
viz.save(Path("my_session.png"))

This produces a 4-panel dark dashboard showing: the Markov thought-state graph (nodes sized by visit count, colored by failure rate), a PCA trajectory map of all stored runs, a per-task pass/fail timeline, and the brain fire rate.

Threshold tuning

The default threshold is 0.30. Lower it if you want earlier, more aggressive intervention; raise it if the brain fires too often on passing tasks.

brain = BrainAgent(embedder, threshold=0.25)   # more aggressive
brain = BrainAgent(embedder, threshold=0.40)   # more conservative

For a new use case, start at 0.30 and observe brain._code_interventions across 10–20 tasks. If the brain fires on tasks that pass without intervention, raise the threshold. If it never fires on tasks that fail, lower it.

Choosing what tasks to add probes for

Not every task needs a probe. Use probes where:

There is a specific known failure mode (e.g., a particular edge case, formula direction, or off-by-one)
The failure is deterministically testable with a small synthetic input
Getting it wrong silently corrupts downstream tasks

Skip probes for:

Tasks with ambiguous success criteria
Tasks where any reasonable implementation is acceptable
Text-generation tasks (probes only work on python_exec)

A realistic day-of-work pattern

This mirrors how the portfolio analyzer eval was run. Each task is a sequential step in a larger project; the brain accumulates signal across all of them:

from trace_use import BrainAgent, build_embedder, tool_agent, extract_code, code_judge
from pathlib   import Path
import json, time

def my_probe(ns: dict) -> list[str]:
    """Your deterministic edge-case test for the current task."""
    fn = ns.get("my_function")
    if not fn:
        return ["my_function not defined"]
    result = fn(known_input)
    if result != expected_output:
        return [f"Got {result}, expected {expected_output}. FIX: ..."]
    return []

def my_check(ns: dict, stdout: str) -> bool:
    """Your full pass/fail verifier — same function you'd pass to code_judge."""
    fn = ns.get("my_function")
    return fn and fn(case1) == ans1 and fn(case2) == ans2

TASKS = [
    {"name": "Step 1", "prompt": "Write my_function that ...", "probe": my_probe},
    {"name": "Step 2", "prompt": "Now extend it to handle ..."},
    # tasks in dependency order — each builds on the previous
]

brain         = BrainAgent(build_embedder(), threshold=0.30, k=5)
agent         = tool_agent(["python_exec"], max_turns=10, model="claude-haiku-4-5-20251001")
agent.monitor = brain
results       = []

for i, task in enumerate(TASKS):
    brain.set_task(i, probe_fn=task.get("probe"))
    brain.reset()
    t0 = time.time()

    trace, tokens = agent(task["prompt"])
    fires         = brain._code_interventions
    code          = extract_code(trace)            # from trace_use import extract_code

    # evaluate with code_judge or your own check function
    verifier   = code_judge(my_check)
    first_pass = verifier(task["prompt"], trace) >= 0.5

    print(f"{i+1}/{len(TASKS)} {'✓' if first_pass else '✗'}  "
          f"{task['name']}  {tokens:,} tok  {time.time()-t0:.0f}s"
          + (f"  [⚡×{fires}]" if fires else ""))

    # store first-attempt trace with first-attempt label
    brain.store(trace, int(first_pass))
    if code:
        brain.store_code(code, int(first_pass))

    results.append({"name": task["name"], "passed": first_pass, "fires": fires})

json.dump(results, open("my_session.json", "w"), indent=2)

Live dashboard (`eval/viz_brain.py`)

BrainViz renders a 4-panel dark dashboard that updates after every task:

Panel	What it shows
Neuron Graph	k-means thought-state nodes sized by visit count, colored green→red by failure rate; transition edges weighted by frequency. Raw scatter shown before Markov activates (≥30 chunks).
Trajectory Map	PCA-2D of all stored chunk embeddings. Each completed run is a polyline: green=pass, red=fail. Clusters of failure trajectories become visible as the store fills.
Score Timeline	Per-task pass/fail bars with cumulative accuracy line.
Fire Report	Brain fires per task + cumulative fire rate vs 30% dashed reference.

Embedder

build_embedder() in agents.py prefers local and falls back to remote:

sentence-transformers (preferred): all-MiniLM-L6-v2, 384-dim, free, ~10ms/chunk on CPU, no API key required
OpenAI text-embedding-3-small: 1536-dim, requires OPENAI_API_KEY

Both return L2-normalised float32 vectors and are drop-in interchangeable.

Verifiers (`pipeline.py`)

The only task-specific input to the pipeline is a Verifier: (question, answer) -> float in [0, 1].

Verifier	When to use
`code_judge(check_fn)`	Programmatic — exec the code and run your assertions
`gold_judge(gold, agent)`	Ground-truth string available
`self_judge(judge_agent)`	No ground truth — use a different model to grade
`tiered_judge(fast, strong)`	Save cost — fast model on easy cases, strong on uncertain
`self_consistency(resample, samples)`	No judge — re-run and check agreement

# code_judge: cleanest signal, use when possible
def check(ns: dict, stdout: str) -> bool:
    fn = ns.get("min_variance_portfolio")
    if not fn: return False
    import numpy as np
    cov = np.diag([0.04, 0.16])
    r = fn(cov)
    w = np.array(r["weights"]).flatten()
    return abs(sum(w) - 1.0) < 0.01 and w[0] > 0.5   # more weight on lower-var asset

verifier = code_judge(check)

`run_task` reference

run_task(
    task            = "...",       # task string
    agent           = haiku,       # callable: prompt -> text or (text, tokens)
    verifier        = verifier,    # callable: (q, trace) -> float
    forecaster      = fc,          # Forecaster instance (optional)
    retriever       = retriever,   # context retriever (optional)
    threshold       = None,        # override adaptive threshold (optional)
    cap             = 8,           # max sub-questions from decompose
    display         = True,        # Rich live terminal output
    retry           = True,        # fire self-critique retry on high P(fail)
    retry_agent     = None,        # different agent for retries
    decompose_agent = None,        # different agent for decomposition
)

Returns a TaskResult with .n_pass, .n_fail, .n_intervened, .summary(), and per-component .components (each with .question, .trace, .p_fail, .label, .retried, .neighbor).

Demos

# classic AUC demos
python demo_general.py          # 40 diverse tasks, CoT haiku, AUC ~0.68
python demo_debug.py            # 29 Python debugging tasks, tool agent, AUC ~0.87
python demo_large.py            # 80+ mixed tasks, full Rich display

# brain interception evals
python eval/eval_fires.py       # 30 diverse domains, haiku, live brain dashboard
python eval/eval_hard.py        # 14 hard one-shot failures, Sonnet
python eval/eval_hard.py --haiku   # same tasks with Haiku
python eval/eval_haiku_intensive.py   # 30 tasks, haiku, intensive
python eval/eval_real_world.py  # 30 hard (segment tree, GPQA-style), haiku
python eval/eval_extensive.py   # 32 tasks, LiveCodeBench Pro / ICPC-Eval difficulty
python eval/eval_project.py     # 15-task portfolio analyzer session, haiku

Repo layout

Path	Role
`pipeline.py`	Public API: `run_task`, `decompose`, `attempt`, `Forecaster`, `make_retriever`, all verifiers
`brain.py`	`BrainAgent`, `TrajectoryStore`, `FailureStore` — inference-time failure interception
`forecast.py`	Primitives: `knn_predict`, `knn_predict_cross`, `auc`, `spearman`
`display.py`	Rich live terminal display used by `run_task`
`agents.py`	`haiku`, `opus`, `tool_agent`, `streaming_agent`, `build_embedder` (lazy clients, keys from env/`.env`)
`demo_general.py`	40 diverse tasks, CoT haiku, live plot, AUC ~0.68
`demo_debug.py`	29 Python debugging tasks, tool agent, AUC ~0.87
`demo_large.py`	80+ mixed tasks, full Rich display
`bench/`	Vendored benchmark loaders (FanOutQA, MuSiQue)
`eval/eval_fires.py`	30-task brain eval, diverse domains
`eval/eval_hard.py`	14 hard one-shot failures, Sonnet + Haiku
`eval/eval_haiku_intensive.py`	30-task intensive haiku session
`eval/eval_real_world.py`	30 hard tasks: competitive programming + GPQA-style science
`eval/eval_extensive.py`	32 tasks: LiveCodeBench Pro / ICPC-Eval difficulty + GPQA-style
`eval/eval_project.py`	15-task portfolio risk analyzer — the day-in-the-life benchmark
`eval/viz_brain.py`	4-panel live brain dashboard
`eval/results/`	All saved charts and JSON run logs
`tests/`	Offline test suite: `test_forecast.py`, `test_pipeline.py` (~150 tests, ~2s, fully stubbed)

Limitations

Probe tests need a known failure mode. They catch localized bugs — a wrong formula, a missed edge case, a boundary condition. When the whole algorithm approach is wrong, probe feedback alone can't recover it (seen in the extensive benchmark: segment tree, Kruskal's MST).
kNN signal needs warm-up. The trajectory and code-snippet stores need ~50 traces with mixed outcomes before predictions are reliable. Probe tests work immediately; kNN fires later as failures accumulate.
Trace richness is required. One-liner responses produce near-identical embeddings regardless of correctness. Use a tool-calling agent or wrap any text model in a CoT prompt that forces step-by-step output.
Verifier quality sets the ceiling. Mislabeled traces corrupt the kNN store. Prefer programmatic checks; when using an LLM judge, always use a different model than the one being evaluated.
Brain is most impactful in the 15–40% failure band. Above ~90% pass rate, fires are rare and marginal gains are small. Below ~60%, the store fills quickly with failures but the model may need a fundamentally different approach rather than mid-turn correction.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Jun 29, 2026

This version

0.1.0

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trace_use-0.1.0.tar.gz (73.5 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trace_use-0.1.0-py3-none-any.whl (45.5 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file trace_use-0.1.0.tar.gz.

File metadata

Download URL: trace_use-0.1.0.tar.gz
Upload date: Jun 26, 2026
Size: 73.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for trace_use-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`35fbd0669787a09d0953b5aa7dc3345918f2a9bd2e70857e1bc469e305666b01`
MD5	`a5f095be2a21faf7cf587753324a6d67`
BLAKE2b-256	`de196a7bda918743f27058516621fef911e4910f28c5617634626bc9bdb1fd12`

See more details on using hashes here.

File details

Details for the file trace_use-0.1.0-py3-none-any.whl.

File metadata

Download URL: trace_use-0.1.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 45.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for trace_use-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`81a86ba0eaa7837b721d5d5ed67ed901711bb2ca7cb34bbf8d7db2bbfc3ee86c`
MD5	`6f41003b12f9f9595b75177f8b132ad6`
BLAKE2b-256	`a06bad687eb8759d9aa4c947b647d7611a91f5267287a783369bec0dca19cc4e`

See more details on using hashes here.

trace-use 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

trace_use

The key insight: the trace carries the failure signal

The Brain (brain.py)

Signal 1 — Probe tests (deterministic, fires immediately)

Signal 2 — kNN over stored code snippets (learned)

Signal 3 — Trajectory prefix kNN + Markov chain (learned)

Wiring it up

BrainAgent public API

Storage invariant

The Forecaster (pipeline.py)

Quickstart

With a tool-use agent

Forecaster API

Results

Summary across all evaluations

Multi-hop QA — per-component forecasting

Hard one-shot failures — Sonnet + Brain (eval/eval_hard.py)

Real-world hard tasks — 30 tasks (eval/eval_real_world.py)

Extensive hard-task benchmark — 32 tasks (eval/eval_extensive.py)

Day-in-the-life project eval — Portfolio Risk Analyzer (eval/eval_project.py)

Use it in your own projects

Install

Minimal setup — wrap any task loop in 5 minutes

Add probe tests for maximum benefit

Track what the brain is doing

Threshold tuning

Choosing what tasks to add probes for

A realistic day-of-work pattern

Live dashboard (eval/viz_brain.py)

Embedder

Verifiers (pipeline.py)

run_task reference

Demos

Repo layout

Limitations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

The Brain (`brain.py`)

`BrainAgent` public API

The Forecaster (`pipeline.py`)

`Forecaster` API

Hard one-shot failures — Sonnet + Brain (`eval/eval_hard.py`)

Real-world hard tasks — 30 tasks (`eval/eval_real_world.py`)

Extensive hard-task benchmark — 32 tasks (`eval/eval_extensive.py`)

Day-in-the-life project eval — Portfolio Risk Analyzer (`eval/eval_project.py`)

Live dashboard (`eval/viz_brain.py`)

Verifiers (`pipeline.py`)

`run_task` reference