Failure attribution for agent pipelines — given an AgentTrace and a score, Origin finds which node(s) caused the failure.

These details have not been verified by PyPI

Project description

aevyra-origin

When an agent fails, the cause is rarely obvious. Origin takes the trace of what ran, the score of how it did, and a rubric of what good looks like — and tells you which span failed, why, and what kind of fix it needs.

Witness  →  captures what happened         (aevyra-witness)
Verdict  →  judges it                      (aevyra-verdict)
Origin   →  finds where it went wrong      (you are here)
           └─ fix_type="prompt"? → Reflex  (aevyra-reflex)
           └─ fix_type="retrieval"?  → fix the index
           └─ fix_type="tool_schema"? → fix the schema
           └─ fix_type="routing"?    → fix the router
           └─ fix_type="infrastructure"? → fix ops

flowchart LR
    TR[AgentTrace\nfrom Witness]:::data
    SC[score + rubric\nany scorer]:::data

    CR[critic\n1 LLM call]:::method
    DC[decomposition\n1 LLM call]:::method
    AB[ablation\nreplay runner]:::method

    MG([merge +\ncorroborate]):::origin

    PR[fix_type=prompt\n→ Reflex]:::prompt
    OT[retrieval · routing\ntool_schema · infra\n→ targeted fix]:::other

    TR & SC --> CR & DC & AB
    CR & DC & AB --> MG
    MG --> PR
    MG --> OT

    classDef data    fill:#6E3FF3,color:#fff,stroke:none
    classDef method  fill:#9B6BFF,color:#fff,stroke:none
    classDef origin  fill:#3FBFFF,color:#fff,stroke:none
    classDef prompt  fill:#2ECC71,color:#fff,stroke:none
    classDef other   fill:#444,color:#fff,stroke:none

Origin takes a score from any source — Verdict, a custom function, or a plain lambda. Verdict is the recommended path but not required.

Use cases

Debugging a failing agent — know whether the planner, a retrieval step, or a tool call caused the bad output, without adding print statements or re-running manually.
Prioritising fixes — not all failures are prompt failures. Origin tells you whether to rewrite a prompt, fix a retrieval index, or correct a tool schema before you spend time on the wrong thing.
Routing to Reflex — when fix_type="prompt", hand the attribution directly to Reflex for automated prompt repair. Origin's by_prompt() gives Reflex exactly the prompt-level view it needs.

Works with any LLM — Claude, OpenAI, OpenRouter, local Ollama or vLLM, or any OpenAI-compatible endpoint.

Install

pip install aevyra-origin               # Claude included by default
pip install aevyra-origin[openai]       # add OpenAI, OpenRouter, Together, Groq, Ollama
pip install aevyra-origin[all]          # everything

Python 3.10+.

Provider	Extra	Env var
Anthropic	(included)	`ANTHROPIC_API_KEY`
OpenAI	`[openai]`	`OPENAI_API_KEY`
OpenRouter	`[openai]`	`OPENROUTER_API_KEY`
Together AI	`[openai]`	`TOGETHER_API_KEY`
Groq	`[openai]`	`GROQ_API_KEY`
Ollama	`[openai]`	—

Quick start

Instrument your pipeline with @span, hand Origin a rubric and a judge, get back an attribution:

from aevyra_witness.runtime import span
from aevyra_origin import diagnose_pipeline
from aevyra_origin.llm import anthropic_llm
from aevyra_origin.judges import judge_from_verdict
from aevyra_verdict import LLMJudge
from aevyra_verdict.providers import get_provider

@span("classify")
def classify(text): ...

@span("retrieve")
def retrieve(topic): ...

@span("answer", optimize=True, prompt_id="answer_v1")
def answer(q, docs): ...

def my_agent(q):
    topic = classify(q)
    return answer(q, retrieve(topic))

judge = judge_from_verdict(LLMJudge(judge_provider=get_provider("anthropic")))

result = diagnose_pipeline(
    my_agent, "I was charged twice — how do I get a refund?",
    judge=judge,
    rubric="Accurate, grounded in the policy docs, and addresses the user's concern.",
    llm=anthropic_llm(),
)

print(result.render())

diagnose_pipeline runs your pipeline under a tracer, scores the captured trace, and invokes the attribution engine — all in one call. result.render() prints something like:

Origin attribution  (method=all, score=0.31)
  Summary: The retrieve span failed to surface the refund policy document,
  leaving the answer span without the grounding it needed. The classify
  span contributed by routing to the wrong topic, narrowing the retrieval
  scope before it even ran.

  1. retrieve (id=n2)  [primary, confidence=0.89, fix=retrieval]
     Returned generic FAQ results; the refund policy doc was not in the
     retrieved set despite being present in the index.

  2. classify (id=n1)  [contributing, confidence=0.44, fix=routing]
     Classified as "billing/general" rather than "billing/refund",
     causing the retriever to miss the policy-specific corpus.

  3. answer (id=n3)  [minor, confidence=0.18, fix=prompt]
     Given the missing context, the answer defaulted to a generic
     apology rather than citing the 30-day refund window.

  --- Prompt-level rollup (for Reflex) ---
  prompt=answer_v1  [minor, confidence=0.18, spans=1]

The fix_type tells you where to direct the repair effort. Only spans with fix_type="prompt" are candidates for Reflex; the others need a different intervention.

Don't have a Verdict metric? Pass any Callable[[AgentTrace], float] as judge= — including a lambda that wraps your own evaluator.

Three on-ramps

The turnkey path is the recommended starting point, but Origin's attribution engine works with any trace you can produce:

Turnkey — give Origin your pipeline and it handles tracing + scoring: diagnose_pipeline(pipeline, input, judge, rubric, llm). Your pipeline just needs @span decorators from aevyra_witness.runtime.
Adapter — if you already emit framework logs (OpenClaw JSONL today; LangSmith, OTel, and others are additive), parse them into an AgentTrace and hand it to Origin:
```
from aevyra_witness.adapters import from_openclaw_jsonl
trace = from_openclaw_jsonl(log_lines)
origin.diagnose(trace=trace, score=0.4, rubric=...)
```

Raw — you already have an AgentTrace and a score:

from aevyra_origin import Origin
origin = Origin(llm=anthropic_llm())
result = origin.diagnose(trace=my_trace, score=0.4, rubric=...)

What Origin diagnoses

Not all agent failures are prompt failures. Origin classifies each culprit span into one of six fix types:

`fix_type`	What it means	Who fixes it
`prompt`	The instructions or context in the prompt need changing	Reflex
`tool_schema`	The tool's input schema is ambiguous; the LLM called it wrong	Schema redesign
`retrieval`	The retrieval step fetched wrong, irrelevant, or missing docs	Index / embedding fix
`routing`	The pipeline sent the query down the wrong branch or tool	Routing logic fix
`infrastructure`	A transient or systemic issue: timeout, rate limit, auth error	Ops / infra fix
`unknown`	Origin could not determine the fix type	Manual review

This matters because Reflex can only help with fix_type="prompt". When Origin tells you the problem is in the retrieval index or the tool schema, you know immediately where to look — and that rewriting the prompt won't help.

Methods

Origin ships three attribution methods that can be run individually or combined.

LLM-as-critic (method="critic") makes one LLM call. The LLM reads the rubric, score, and full trace, and returns a ranked list of culprit spans with severity, confidence, reasoning, and fix_type. Fast and general — works for any rubric. Best for single-cause failures.

Score decomposition (method="decomposition") also makes one LLM call, but approaches it differently. The LLM enumerates the rubric's underlying criteria, attributes each criterion to the span(s) responsible, and aggregates per-span blame across failed criteria. Better at surfacing distributed failures where multiple steps each contributed.

Ablation (method="ablation") is the causal method. For each candidate span, it replaces the span's output with a neutral placeholder, re-runs the pipeline via a user-supplied runner, and re-scores via the judge. It's the only method that makes a causal claim — a large score delta means the span is genuinely responsible. Requires a deterministic runner.

method="all" runs all available methods and merges the results. The two LLM methods always run (two LLM calls total). Ablation participates when a runner is supplied; otherwise it's silently skipped. Spans flagged by multiple methods receive a corroboration bonus. fix_type is resolved to the most specific type across methods ("retrieval" wins over "unknown").

Ablation quick start

from aevyra_origin import diagnose_pipeline
from aevyra_witness import AgentTrace

def my_runner(trace: AgentTrace, overrides: dict) -> AgentTrace:
    # Replay the pipeline with overrides[span_id] forced as the output.
    # LLM calls should be cached or mocked for determinism.
    ...

result = diagnose_pipeline(
    my_agent, "how do I refund?",
    judge=judge, rubric=rubric, llm=anthropic_llm(),
    runner=my_runner,
    method="all",
)

Ablation cost control: ablation_budget=N caps total runs. The raw on-ramp exposes candidates=["span_a", "span_b"] to limit the sweep to specific span ids.

API

`diagnose_pipeline(...)` → `Attribution`

result = diagnose_pipeline(
    pipeline, *args,                  # your callable + whatever it takes
    judge=...,                        # Callable[[AgentTrace], float]
    rubric=...,                       # str
    llm=...,                          # Callable[[str], str]
    ideal=None,                       # optional reference output
    trace_metadata=None,              # dict, stored on the trace
    method="all",                     # "critic" | "decomposition" | "ablation" | "all"
    runner=None,                      # ablation replay function (enables ablation)
    ablation_placeholder="null",      # "null" or "ideal"
    ablation_budget=None,             # cap ablation runs
    **kwargs,                         # forwarded to your pipeline
)

`Origin.diagnose(...)` → `Attribution`

result = origin.diagnose(
    trace=...,                        # AgentTrace
    score=...,                        # float, the judge score being explained
    rubric=...,                       # str
    method="all",                     # "critic" | "decomposition" | "ablation" | "all"
    ablation_placeholder="null",
    ablation_budget=None,
)

`Attribution`

result.summary              # str — one-paragraph overview
result.culprits             # list[NodeAttribution], sorted by confidence desc
result.method               # "critic" | "decomposition" | "ablation" | "all"
result.score                # float — the judge score
result.raw                  # dict — pipeline_output, captured_trace, method-level raw outputs

result.top_culprit()        # NodeAttribution | None
result.primary_culprits()   # [NodeAttribution] — severity="primary"
result.by_prompt()          # [PromptAttribution] — roll blame up to prompt_id for Reflex
result.render()             # str — multi-line CLI rendering
result.to_json(indent=2)    # str — JSON serialization

`NodeAttribution`

c.node_name                 # str — matches a name in the trace
c.severity                  # "primary" | "contributing" | "minor"
c.confidence                # float in [0, 1]
c.reasoning                 # str — grounded in the trace
c.node_id                   # str | None — span id (required for DAG traces with repeated names)
c.prompt_id                 # str | None — prompt identity; used by by_prompt() rollup
c.fix_type                  # "prompt" | "tool_schema" | "retrieval" | "routing"
                            #           | "infrastructure" | "unknown"

`Attribution.by_prompt()` → `list[PromptAttribution]`

For DAG traces where the same prompt fires at many call sites, by_prompt() rolls span-level blame up to the prompt level — mean confidence across spans sharing a prompt_id, max severity, concatenated reasoning. Only culprits with fix_type="prompt" are meaningful inputs to Reflex.

for pa in result.by_prompt():
    print(pa.prompt_id, pa.severity, pa.confidence)

`judge_from_verdict(metric, *, ...)` → `Judge`

Adapts any Verdict Metric (LLMJudge, ExactMatch, BleuScore, RougeScore, custom metrics) to Origin's Callable[[AgentTrace], float] contract. Duck-typed — no hard Verdict dependency.

from aevyra_origin.judges import judge_from_verdict
from aevyra_verdict import LLMJudge

judge = judge_from_verdict(LLMJudge(judge_provider=provider))

Customize what gets fed to the metric with extract_response=... and extract_messages=... when the defaults aren't right for your pipeline.

CLI

For pre-captured traces (the raw on-ramp):

aevyra-origin diagnose trace.json \
  --score 0.4 \
  --rubric rubric.txt \
  --model anthropic/claude-sonnet-4-5 \
  --method all \
  --output result.json     # optional — writes full Attribution JSON

--rubric - reads from stdin. --model follows the same provider/model convention as aevyra-reflex — openrouter/qwen/qwen3-8b, openai/gpt-4o, ollama/qwen3:8b. The render (including prompt-level rollup for Reflex) always goes to stdout.

Interop with Reflex

by_prompt() on the result gives Reflex the prompt-level view it needs. Only culprits with fix_type="prompt" are handed to Reflex — the others (retrieval, routing, infrastructure, tool_schema) need a different repair.

# What Reflex consumes:
for pa in result.by_prompt():
    print(pa.prompt_id, pa.severity, pa.confidence)

# Wire Origin's LLM to Reflex's LLM type:
from aevyra_reflex import LLM
from aevyra_origin.llm import LLMFn

reflex_llm = LLM(model="claude-sonnet-4-5")
llm: LLMFn = lambda p: reflex_llm.generate(p, temperature=0.0)

License

Apache-2.0.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aevyra_origin-0.1.0.tar.gz (75.3 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aevyra_origin-0.1.0-py3-none-any.whl (64.0 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file aevyra_origin-0.1.0.tar.gz.

File metadata

Download URL: aevyra_origin-0.1.0.tar.gz
Upload date: Apr 29, 2026
Size: 75.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for aevyra_origin-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`41c250a947ee5af7bd427fe2a784ffdb26e70f9e2b4ee9fa76bc3a9e8a938209`
MD5	`304fa00c3dd3c8bb70e031ea24de52a5`
BLAKE2b-256	`3fd0626ffc5be6ddd14c3190bcfb10a460f26faf732f5699c02d27dbae278ba7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aevyra_origin-0.1.0.tar.gz:

Publisher: publish.yml on aevyraai/origin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aevyra_origin-0.1.0.tar.gz
- Subject digest: 41c250a947ee5af7bd427fe2a784ffdb26e70f9e2b4ee9fa76bc3a9e8a938209
- Sigstore transparency entry: 1401875876
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: aevyraai/origin@076b21920a8b7a1957868f3746ac327fdaad1a92
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/aevyraai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@076b21920a8b7a1957868f3746ac327fdaad1a92
- Trigger Event: workflow_dispatch

File details

Details for the file aevyra_origin-0.1.0-py3-none-any.whl.

File metadata

Download URL: aevyra_origin-0.1.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 64.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for aevyra_origin-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1436d325514dbd8ed9e3f5a0f6c41ddc42723a712f046d7abfe484dd2563f6d2`
MD5	`d3d4d3028bc6522cd3e098270f641f2c`
BLAKE2b-256	`6e592d6fcd39b4b85ad9774373953452e0e2a3f0c62e63c78ef6a23c7325bced`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aevyra_origin-0.1.0-py3-none-any.whl:

Publisher: publish.yml on aevyraai/origin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aevyra_origin-0.1.0-py3-none-any.whl
- Subject digest: 1436d325514dbd8ed9e3f5a0f6c41ddc42723a712f046d7abfe484dd2563f6d2
- Sigstore transparency entry: 1401875924
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: aevyraai/origin@076b21920a8b7a1957868f3746ac327fdaad1a92
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/aevyraai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@076b21920a8b7a1957868f3746ac327fdaad1a92
- Trigger Event: workflow_dispatch

aevyra-origin 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

aevyra-origin

Use cases

Install

Quick start

Three on-ramps

What Origin diagnoses

Methods

Ablation quick start

API

diagnose_pipeline(...) → Attribution

Origin.diagnose(...) → Attribution

Attribution

NodeAttribution

Attribution.by_prompt() → list[PromptAttribution]

judge_from_verdict(metric, *, ...) → Judge

CLI

Interop with Reflex

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`diagnose_pipeline(...)` → `Attribution`

`Origin.diagnose(...)` → `Attribution`

`Attribution`

`NodeAttribution`

`Attribution.by_prompt()` → `list[PromptAttribution]`

`judge_from_verdict(metric, *, ...)` → `Judge`