Skip to main content

LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is

Project description

yuragi — Measure how unstable your LLM's confidence really is

PyPI version PyPI downloads CI License Python 3.11+

Instant Demo

No API key needed:

pip install yuragi
yuragi demo

yuragi demo output


What It Does

yuragi measures confidence fragility: how much a model's certainty shifts when you rephrase the same question. It generates 13 perturbation variants of your prompt (typos, tone changes, paraphrases, authority framing), calls your model, and compares the confidence across responses. When the answer text stays the same but confidence moves, that's fragility — a property of the prompt wording, not the model's knowledge.

v0.5.0 also ships yuragi.guardrails — a confidence-aware multi-agent runtime with append-only audit logging, Git-like state snapshots, and AutoGen / LangGraph integrations. See the Guardrails section below.


Install

pip install yuragi

Optional extras:

pip install yuragi[viz]                  # heatmap / reliability diagram output
pip install yuragi[semantic]             # sentence-transformers for semantic entropy
pip install yuragi[stats]                # numpy/scipy for statistical tests
pip install yuragi[guardrails]           # confidence-aware LLM guardrails (stdlib only)
pip install yuragi[guardrails-autogen]   # AutoGen integration
pip install yuragi[guardrails-langgraph] # LangGraph integration
pip install yuragi[guardrails-nats]      # NATS JetStream distributed transport
pip install yuragi[all]                  # everything

Supports any litellm-compatible provider — OpenAI, Anthropic, Google, local Ollama, and 100+ others.


Python API

from yuragi import Scanner

result = Scanner(model="cerebras/llama-3.1-8b-instruct").scan("Is quantum computing practical?")
print(result.fragility_score)    # 0.056
print(result.dissociation_rate)  # 0.07 — answer same, confidence shifted
Psychology experiments / Trilayer / Semantic Entropy API
from yuragi.experiments.registry import get_experiment
from yuragi.experiments.runner import run_experiment

result = run_experiment(get_experiment("asch"), model="ollama/llama3.2", num_samples=5)
print(result.avg_delta)        # average confidence change
print(result.effect_confirmed) # True if max_delta >= 0.15
from yuragi.analysis.trilayer import measure_trilayer

result = measure_trilayer("Is AI dangerous?", model="ollama/llama3.2")
print(result.logprob_confidence)     # Layer 1: token probability
print(result.sampling_confidence)    # Layer 2: behavioral consistency
print(result.verbalized_confidence)  # Layer 3: self-reported
print(result.internal_conflict)      # True if discrepancy > 0.2
from yuragi.metrics.semantic_entropy import semantic_entropy
h_sem = semantic_entropy(samples=["Paris", "It's Paris.", "The capital is Paris"])

CLI Quickstart

# Scan a prompt for fragility
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct

# Find the single weakest word
yuragi find-weakness "Explain the theory of relativity" --model ollama/llama3.2

# Run a psychology stress test
yuragi experiment asch --model ollama/llama3.2

Use Cases

CI/CD regression detection — catch fragility regressions before they reach production:

yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini

Fragility-aware routing — route each prompt to the model that answers most stably:

yuragi route "What causes inflation?" --models gpt-4o-mini,ollama/llama3.2,cerebras/llama-3.1-8b

Abstention guard — refuse to answer when fragility exceeds safety thresholds (medical: < 0.03, safety: < 0.02):

yuragi guard "What medication should I take?" --domain medical --model gpt-4o-mini

Model selection — find the best model for your use case by fragility profile:

yuragi recommend --use-case factual --models gpt-4o-mini,ollama/llama3.2 --budget medium

Automated red teaming — discover model weaknesses across all 13 perturbation types:

yuragi red-team prompts.txt --model gpt-4o-mini --output report.json

Status & Research Findings

yuragi is primarily a measurement and stress-testing library for confidence-under-perturbation. Earlier versions framed ensemble AUC numbers as a hallucination detector; after a multi-round internal audit (permutation + BH-FDR + length-residualization + split-conformal finite-sample CIs), most headline predictive claims did not survive multiple-testing correction or independent replication. The measurement instruments below are stable and usable; the specific predictive findings are exploratory and should be treated as hypotheses pending external replication at n≥400 per model with an independent hold-out.

Real-data empirical results on llama-3.1-8B-Instruct (Cerebras + NVIDIA NIM endpoints) and Pythia-410m, April 2026.

✅ Findings that survived multiple-testing correction (2)

  1. Null: the 13 perturbations add no predictive signal over a single baseline_confidence feature. 54 tests across 3 subsets × 2 label sources × 9 fragility features, all |z|<2; in a 60-test horizontal sweep, the BH-FDR q=0.05 survivor count is 0/60. A robust, subset-consistent null — fragility features are not worth their inference cost over just reading the model's stated confidence. Source: experiments/ablation_pivot_subset_robust_report.txt.
  2. TriviaQA / Pythia-410m, n=85: baseline_confidence AUC with split-conformal 95% CI [0.596, 0.783], not length-confounded (length-residualized Δ < 0.01). Single formal survivor of our conformal-prediction sweep. Source: experiments/uq_ensemble_sweep/verdict.md.

⚠️ Exploratory — do not rely on these as validated claims

  • Ensemble AUC 0.73 on TruthfulQA n=412 / 105 features. Wilson CI [0.678, 0.779] does not cross 0.5, but the finding was not pre-registered, there is no independent hold-out, and a scaled-down replication (n=100, 4-feature ensemble across 3 independent datasets) produced OOF AUCs below the best single feature in all 3, with BH-FDR q=0.05 yielding a single survivor that is single-feature bc, not the ensemble. Adding the 13 perturbation features over a no-perturbation baseline produces Δ=−0.027, 95% CI [−0.085, +0.035], p=0.35 — not statistically significant. On is_correct (flexible-judge) labels, perturbations in fact reduce AUC significantly (Cerebras n=382: Δ=−0.05 p=0.012). Treat as a hypothesis pending n≥400 replication with an independent hold-out. Sources: experiments/ensemble_final.txt, experiments/ablation_delta_significance_report.txt.
  • Confidence sign-inversion on 8B. Higher self-reported confidence correlating with higher hallucination probability (TriviaQA n=200 raw AUC 0.252 → inverted 0.748). Length-residualized AUC falls to 0.612. Single provider family, single hardware pair, no cross-family replication at n≥400.
  • Multi-judge majority n=200 AUC 0.635 — subset-artifact: on the rest-212 complement the same method drops to 0.502 (chance), permutation p=0.002. Retracted as a standalone claim.
  • Single-signal solo AUC across 6 datasets (TruthfulQA, TriviaQA, NQ-Open, NIM, Cohere, Mistral) is ~0.50 for fragility_score; the earlier "0.62 noise floor" claim is retracted.
  • Fragility scaling trend F(N)=a/√N+b (R²=0.987) — 5-model curve fit, no multiple-testing correction, no hold-out, exploratory.
  • Activation-patching L12–L13 "double dissociation" (paper Contribution 2) — rescued with an L23-control experiment at n=10 on Pythia-410m; an independent causal-tracing sweep on the same prompts places the per-prompt peak layer across L2–L17 (mean 7.5, std 4.3; only 1/10 near L12–L13). Patching-level dissociation stands; the stronger "the circuit lives at L12–L13" reading does not survive an independent method at this n.

🧪 Measurement reliability (what the library is good at today)

Test–retest Pearson correlation on paired scans (same prompt, different seed):

Signal r Recommendation
baseline_confidence 0.88 ✓ Primary
paraphrase_fragility 0.80 ✓ Primary
adaptive_fragility 0.78 ✓ Primary
impostor_fragility 0.70 ○ Supporting
fragility_score (aggregate) 0.64 ○ Supporting
counterfactual_fragility 0.18 ✗ Noise-dominated, do not use

The perturbation-and-confidence measurement suite is stable. Whether fragility predicts hallucination is open and strongly dataset-dependent: it works on single-path factoids ("Who discovered argon?" → AUC ~0.75) and fails on imitative-falsehood benchmarks ("What happens if you break a mirror?" → AUC ~0.50). See paper/domain_boundary_section.md.

Supporting observation: when answer text is identical (Jaccard=1.0), max confidence shift is 0.021 (below the noise floor); when text differs, confidence shifts up to 0.528. This is evidence that verbalised confidence tracks surface text more than underlying knowledge. See RESEARCH.md.

📉 Methodological caveats

  • 20+ post-hoc audits on the same dataset ⇒ p-hacking risk; multiple-testing correction was not applied across audits.
  • Single base model (Cerebras Llama-3.1-8B) for most results; n=9 frontier pilots underpowered.
  • No pre-registration, no independent hold-out, no 3-seed replication.
  • Length bias (Spearman ρ=+0.35, longer answers graded more leniently) is partially entangled with fragility_score (Δ=+0.022 after length-residualization); prefer length-residualized AUC.
  • Public benchmarks with higher AUC exist (SSP ~0.786 output-side, LSD ~0.96 activation-side). Output-level methods like yuragi are bounded by the mutual information I(correct; h_internal) accessible from the output surface.

See KNOWN_LIMITATIONS.md and experiments/ for 20+ raw audit reports.


Integration

pandas — score a DataFrame of prompts:

import pandas as pd
from yuragi import Scanner

scanner = Scanner(model="gpt-4o-mini")
df["fragility"] = df["prompt"].apply(lambda p: scanner.scan(p).fragility_score)

pytest — assert stability in tests:

from yuragi import Scanner

def test_prompt_stability():
    result = Scanner(model="gpt-4o-mini").scan("What is the capital of France?")
    assert result.fragility_score < 0.05, f"Fragility too high: {result.fragility_score}"

GitHub Actions — CI/CD fragility gate:

- name: Check fragility regression
  run: yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini

A reusable GitHub Actions workflow is included.


Guardrails (v0.5.0)

yuragi.guardrails is an opt-in subpackage that turns yuragi from a measurement library into a confidence-aware LLM guardrail platform. It is shipped inside the yuragi wheel — no extra install needed for the core — and adds zero runtime dependencies (only the standard library).

from yuragi.guardrails import (
    AuditLog,
    ConfidencePolicy,
    ConfidenceReport,
    Runtime,
    PlannerAgent, ExecutorAgent, CriticAgent,
    ResearcherAgent, VerifierAgent,
)

# 1. Append-only audit log with SHA-256 hash chain
log = AuditLog("./audit.db")

# 2. A multi-agent mesh with confidence-aware routing
async with Runtime(audit_log=log) as rt:
    await rt.spawn(PlannerAgent, name="planner")
    await rt.spawn(ExecutorAgent, name="executor")
    await rt.spawn(CriticAgent, name="critic", policy=ConfidencePolicy(tau=0.85))
    await rt.spawn(ResearcherAgent, name="researcher")
    await rt.spawn(VerifierAgent, name="verifier")
    await rt.publish("planner", {"task": "summarise quantum tunneling", "complexity": 6})

# 3. Verify nobody tampered with the audit trail later
assert await log.verify_chain()

Differentiators against existing OSS guardrails:

Feature NeMo Guardrails Guardrails AI Llama Guard LangKit yuragi.guardrails
Confidence-aware routing fused 4-signal score
Tamper-evident audit log SHA-256 hash chain
Crash-resume snapshots Merkle DAG, ≤ 1 s target
Public benchmarks partial exploratory (see Status & Research §)

Framework integrations (each behind an extras gate so the core stays light):

from yuragi.guardrails.integrations.autogen import AutoGenGuardrail   # pip install yuragi[guardrails-autogen]
from yuragi.guardrails.integrations.langgraph import guardrail_node   # pip install yuragi[guardrails-langgraph]

The runtime ships with InMemoryTransport by default; for distributed deployments install yuragi[guardrails-nats] and pass NatsTransport(...) instead. NATS support is experimental in v0.5.0 — see KNOWN_LIMITATIONS.md (G1–G3) before relying on it in production.

A complete demo lives at examples/guardrails_smoke.py.


Full CLI Reference

All 18 commands
Command Description
demo Run pre-computed demo (no API key needed)
scan Full fragility scan (13 perturbation types)
find-weakness Find the single word that most collapses confidence
experiment Run a psychology template (11 types)
compare-models Multi-model fragility comparison with heatmap
check CI/CD fragility regression detection
route Fragility-aware multi-model routing
guard Abstention system for high-stakes domains
recommend Model selection based on fragility profiles
red-team Automated vulnerability discovery
trajectory Track confidence across a prompt sequence
stats Statistical analysis (Cohen's d, Wilcoxon, bootstrap CI)
trilayer Measure confidence via 3 simultaneous methods
profile Fragility profile: CCI / RE / NLS
linguistic Analyze linguistic confidence markers (hedges, assertiveness)
volatility Financial-engineering metrics (VIX, Sharpe ratio) for confidence
phase-map Map confidence phase transitions across parameter space
compare Compare two scan results (A/B test)
export Export scan results to CSV/JSON

Research

Key discoveries, empirical data, and scaling trends: RESEARCH.md

White-box layer entropy experiments:

python experiments/whitebox_design.py --exp entropy_trajectory
python experiments/whitebox_design.py --exp critical_layer_heatmap
python experiments/whitebox_design.py --exp cpu  # lightweight demo

See also docs/related_work.md for comparison with lm-polygraph, SelfCheckGPT, PromptBench, CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS.


Papers

ICML 2026 MI Workshop (submission target 2026-05-07):

"From Black-Box Fragility to White-Box Dynamics: Layer-Resolved Entropy Signatures of Confidence Perturbation in LLMs"

Source: paper/icml2026_mi/. Three contributions, all exploratory at current sample sizes (see Status & Research § for caveats): (1) conditional two-phase entropy signature with an L12–L13 activation-patching dissociation rescued via an L23-control at n=10; an independent causal-tracing sweep on the same prompts does not localise to L12–L13, (2) perturbation-type-specific layer pathways, (3) confidence stability scaling trend F(N) = a/√N + b (R²=0.987 over 5 models, no multiple-testing correction, no hold-out).

EMNLP ARR 2026 (in preparation, target 2026-05-25):

"Intent-Misalignment Hallucination: Perturbation-Driven Detection of Specification-Ignored LLM Generation"

Outline: paper/emnlp2026_intent/OUTLINE.md. Introduces intent-misalignment hallucination — outputs that are syntactically correct and instruction-compliant yet ignore per-user project context — and proposes context-stripping perturbation (CSP) for detection. Seed dataset (30 tasks × 3 ecosystems) at seed_tasks.jsonl.

Citation

@misc{yuragi2025,
  title  = {yuragi: Confidence Fragility in Neural Networks},
  author = {hinanohart},
  year   = {2026},
  url    = {https://github.com/hinanohart/yuragi}
}

Contributing / License

Issues and PRs welcome. See CONTRIBUTING.md.

Known limitations: KNOWN_LIMITATIONS.md. Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.

Apache License 2.0 for human use.

AI / ML training opt-out

This repository is opted out of AI/ML training, fine-tuning, evaluation, and embedding generation. See ai.txt. Using this work to train machine-learning models without separately negotiated written permission is explicitly disallowed. The Apache License 2.0 covers human use and software redistribution; it does not grant a training data license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yuragi-0.5.2.tar.gz (295.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yuragi-0.5.2-py3-none-any.whl (237.5 kB view details)

Uploaded Python 3

File details

Details for the file yuragi-0.5.2.tar.gz.

File metadata

  • Download URL: yuragi-0.5.2.tar.gz
  • Upload date:
  • Size: 295.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for yuragi-0.5.2.tar.gz
Algorithm Hash digest
SHA256 b3654f5e2657b64574909b43e50ab469a610a5ac06f874b6ee65d1b1b27e6812
MD5 37fdd3f33a3c75021b014160b05e8c69
BLAKE2b-256 b13162642c0c655f1444ea1628056906d75e08288e852d6c47e61fc1558b5c22

See more details on using hashes here.

Provenance

The following attestation bundles were made for yuragi-0.5.2.tar.gz:

Publisher: release.yml on hinanohart/yuragi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yuragi-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: yuragi-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 237.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for yuragi-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c93f4060c17f6eb82057c0d67b838e851b644a499a1bb006fa5525aa2c4321f8
MD5 8e4d86169d2d81ce335b6569c7ddf509
BLAKE2b-256 c159c8c6ce34e0fa91624bfd2087a967d1c08d924e8e1384900db2e59734525d

See more details on using hashes here.

Provenance

The following attestation bundles were made for yuragi-0.5.2-py3-none-any.whl:

Publisher: release.yml on hinanohart/yuragi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page