Skip to main content

Diagnose how your agent's compliance with rules varies across trajectory length.

Project description

halftrace

Diagnose how your agent's compliance with rules varies across trajectory length.

CI PyPI Python License

halftrace is a diagnostic instrument for agent trajectories. Point it at your existing OpenAI, Anthropic, or LangSmith logs and it tells you, per failure mode:

  • what shape of compliance your agent has (perfect / abandoned / bimodal / categorical / gradient),
  • why that shape is likely showing up, and
  • what to try next based on patterns we observed in twelve API pilots.

No new API calls required. No new task harness to wire up. You bring the logs; halftrace does the analysis.

What the empirical record says

Halftrace was originally designed to measure gradual decay of agent capabilities over long trajectories — a concept inherited from the "context rot" intuition. Twelve pilot runs (~64 trajectories, $8.77 of spend, see RESULTS.md) produced a different picture:

Modern Claude (Sonnet 4.6, Haiku 4.5) does not decay gradually on simple agentic tasks at N up to 200. Compliance is categorical per turn-type and bimodal per trajectory — the agent either commits to a rule for the whole trajectory or abandons it from turn 4 onward, with the choice approximately a coinflip.

Three rule designs, two task variants, two models, N from 5 to 200, zero instances of gradient decay on any probe. The original halftrace scalar (the trajectory length at which compliance crosses 50%) is undefined for the shapes we actually observe, so halftrace now reports the shape of the compliance pattern and a commit_probability (fraction of trajectories that follow the rule end-to-end) as the headline metrics. Halftrace-the-scalar is still reported when the shape is gradient — it just doesn't fire on the data we have.

Three workflows

Analyse existing logs

Most agent developers already have trajectory logs. Point halftrace at them:

halftrace analyse --input my_logs.jsonl --format openai

Each line of my_logs.jsonl is one trajectory as an OpenAI chat-completions payload (or an Anthropic messages.create() payload with --format anthropic, or a LangSmith run-tree dict with --format langsmith). The tool ingests each, scores every probe, and prints a profile per probe:

Analysing 124 trajectories from my_logs.jsonl (openai format)

state_amnesia: shape=perfect  commit_p=1.00
instruction_decay: shape=bimodal  commit_p=0.52
  why:  Agent commits-or-abandons per trajectory: the choice is approximately
        a coinflip and is stable for the rest of the trajectory once made...
  try:
    - Add a worked example response in the system prompt...
    - Restate the rule in the initial user message...
    - Consider relaxing the rule on turn 1...
tool_repetition: shape=perfect  commit_p=1.00
narration_substitution: shape=perfect  commit_p=1.00

Compare before and after a prompt change

The iterative prompt-engineering workflow:

halftrace compare --before before_logs.jsonl --after after_logs.jsonl --format openai
Comparing 50 before-trajectories vs 50 after-trajectories (openai format)

[+] instruction_decay: shape=bimodal c=0.52 → shape=perfect c=0.96  (improved)  Δcommit=+0.44  shape: bimodal → perfect
[ ] narration_substitution: shape=perfect c=1.00 → shape=perfect c=1.00  (unchanged)
[ ] state_amnesia: shape=perfect c=1.00 → shape=perfect c=1.00  (unchanged)
[ ] tool_repetition: shape=perfect c=1.00 → shape=perfect c=1.00  (unchanged)

No API spend; the comparison runs entirely over your existing logs.

Run new trajectories against the Anthropic API

If you want a controlled experiment rather than working from production logs:

halftrace pilot --n 5 10 25 --reps 3 --serial

This drives a built-in synthetic task (find_and_synthesise) through Claude at varying trajectory lengths, scores every probe, and emits a profile. See halftrace pilot --help for the model, plant-count, and discovery-mode flags. ANTHROPIC_API_KEY must be set.

What halftrace measures

Four probes ship in the box. Each scores a different failure mode per trajectory:

Probe What it measures
state_amnesia Retention of facts planted earlier in the trajectory
instruction_decay Adherence to a system-prompt rule over time
tool_repetition Avoidance of re-calling tools with identical arguments
narration_substitution Emitting tool calls rather than just describing them

The pilot phase also drafted premature_termination (declaring the task done before all expected work) — it's not yet implemented.

For each probe, halftrace classifies the shape of the per-trajectory score distribution:

Shape When it fires Headline metric
perfect All trajectories score ≥ 0.95 commit_probability
abandoned All trajectories score ≤ 0.05 commit_probability
bimodal High within-cell variance — coinflip per trajectory commit_probability
categorical Stable intermediate compliance — agent applies the rule to one turn-type and drops it on another commit_probability
gradient Monotone decreasing means — the case the original halftrace concept assumed halftrace (also commit_probability)
unclassified None of the above; usually means more reps or wider N needed commit_probability

Each non-perfect shape comes with a one-line cause and 2–3 concrete suggestions drawn from the empirical patterns in RESULTS.md.

Custom probes

A probe is a function Trajectory -> Score. The four shipped probes are 50-100 line files; you can drop in your own:

from halftrace import Score, Trajectory


def too_many_apologies(trajectory: Trajectory) -> Score:
    """Score the fraction of assistant turns that DON'T start with an apology."""
    text_turns = [
        t for t in trajectory.turns
        if t.role == "assistant" and t.content
    ]
    if not text_turns:
        return Score(probe="too_many_apologies", value=None, n_observations=0)
    bad = sum(
        1 for t in text_turns
        if t.content is not None and t.content.lower().startswith("i apologise")
    )
    return Score(
        probe="too_many_apologies",
        value=(len(text_turns) - bad) / len(text_turns),
        n_observations=len(text_turns),
    )

Then score it across your logs and analyse the shape:

from halftrace import from_openai_messages, analyse_compliance, diagnose

trajectories = [from_openai_messages(payload) for payload in load_my_logs()]
scores = [too_many_apologies(t).value for t in trajectories]
scores = [s for s in scores if s is not None]

profile = analyse_compliance({0: scores}, probe="too_many_apologies")
print(f"{profile.shape}: {profile.commit_probability:.2f}")
print(diagnose(profile).cause)

The corresponding ingest functions for the other two formats are from_anthropic_messages and from_langsmith_run — same shape, one trajectory per call.

Probes that need configuration (e.g. instruction_decay's rule) read it from Trajectory.metadata. Tasks set this metadata; ingested trajectories receive it from the source payload's optional metadata field.

What this isn't

  • Not a benchmark. No leaderboard, no canonical task set. The shape classifier characterises your agent on your logs.
  • Not an eval framework. If you want grading, scoring rubrics, or production observability, use Inspect, Braintrust, or LangSmith. halftrace is a diagnostic instrument that sits alongside them.
  • Not an agent framework. It doesn't build agents. It diagnoses agents you've already built.

Install

pip install halftrace

For the pilot subcommand against Claude:

pip install "halftrace[anthropic]"

Optional extras:

pip install "halftrace[openai]"  # not yet wired into the runner; reserved for future
pip install "halftrace[all]"

Requires Python 3.11+.

Reading order

  • README.md — this file: positioning, workflows, custom probes.
  • RESULTS.md — pilot-phase findings, cost ledger, the "modern Claude commits or doesn't" claim with supporting data.
  • HYPOTHESES.md — the original pre-registered design (assumes gradient decay; preserved for transparency about what we expected vs. what we found).

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

halftrace-0.1.2.tar.gz (141.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

halftrace-0.1.2-py3-none-any.whl (48.4 kB view details)

Uploaded Python 3

File details

Details for the file halftrace-0.1.2.tar.gz.

File metadata

  • Download URL: halftrace-0.1.2.tar.gz
  • Upload date:
  • Size: 141.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for halftrace-0.1.2.tar.gz
Algorithm Hash digest
SHA256 33ccd51ff6006e2d02535ab7c05cdea30e6686146230e9f6d05465c90057e655
MD5 a43f8592163d3a709aa1842e52ce642f
BLAKE2b-256 4e465268d4de8f3259b24df368ee0a925a6132e93e958e88f37706833960912d

See more details on using hashes here.

File details

Details for the file halftrace-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: halftrace-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 48.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for halftrace-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1c13bc5f02f9c0638443f8ff09e4cb84a62cba99010e4e046b4b32b11f3821e7
MD5 3fb91008c41c665d4cb1ecb3212484d9
BLAKE2b-256 c81009841fd9fe179b49c4888472db7b8f376ecf27f438999af65a6f3c8e3767

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page