LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is
Reason this release was yanked:
0.5.1 includes security fixes; earlier sdists leaked research artifacts and local paths. Please upgrade.
Project description
yuragi — Measure how unstable your LLM's confidence really is
Instant Demo
No API key needed:
pip install yuragi
yuragi demo
What It Does
yuragi measures confidence fragility: how much a model's certainty shifts when you rephrase the same question. It generates 13 perturbation variants of your prompt (typos, tone changes, paraphrases, authority framing), calls your model, and compares the confidence across responses. When the answer text stays the same but confidence moves, that's fragility — a property of the prompt wording, not the model's knowledge.
v0.5.0 also ships yuragi.guardrails — a confidence-aware multi-agent runtime with append-only audit logging, Git-like state snapshots, and AutoGen / LangGraph integrations. See the Guardrails section below.
Install
pip install yuragi
Optional extras:
pip install yuragi[viz] # heatmap / reliability diagram output
pip install yuragi[semantic] # sentence-transformers for semantic entropy
pip install yuragi[stats] # numpy/scipy for statistical tests
pip install yuragi[guardrails] # confidence-aware LLM guardrails (stdlib only)
pip install yuragi[guardrails-autogen] # AutoGen integration
pip install yuragi[guardrails-langgraph] # LangGraph integration
pip install yuragi[guardrails-nats] # NATS JetStream distributed transport
pip install yuragi[all] # everything
Supports any litellm-compatible provider — OpenAI, Anthropic, Google, local Ollama, and 100+ others.
Python API
from yuragi import Scanner
result = Scanner(model="cerebras/llama-3.1-8b-instruct").scan("Is quantum computing practical?")
print(result.fragility_score) # 0.056
print(result.dissociation_rate) # 0.07 — answer same, confidence shifted
Psychology experiments / Trilayer / Semantic Entropy API
from yuragi.experiments.registry import get_experiment
from yuragi.experiments.runner import run_experiment
result = run_experiment(get_experiment("asch"), model="ollama/llama3.2", num_samples=5)
print(result.avg_delta) # average confidence change
print(result.effect_confirmed) # True if max_delta >= 0.15
from yuragi.analysis.trilayer import measure_trilayer
result = measure_trilayer("Is AI dangerous?", model="ollama/llama3.2")
print(result.logprob_confidence) # Layer 1: token probability
print(result.sampling_confidence) # Layer 2: behavioral consistency
print(result.verbalized_confidence) # Layer 3: self-reported
print(result.internal_conflict) # True if discrepancy > 0.2
from yuragi.metrics.semantic_entropy import semantic_entropy
h_sem = semantic_entropy(samples=["Paris", "It's Paris.", "The capital is Paris"])
CLI Quickstart
# Scan a prompt for fragility
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct
# Find the single weakest word
yuragi find-weakness "Explain the theory of relativity" --model ollama/llama3.2
# Run a psychology stress test
yuragi experiment asch --model ollama/llama3.2
Use Cases
CI/CD regression detection — catch fragility regressions before they reach production:
yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini
Fragility-aware routing — route each prompt to the model that answers most stably:
yuragi route "What causes inflation?" --models gpt-4o-mini,ollama/llama3.2,cerebras/llama-3.1-8b
Abstention guard — refuse to answer when fragility exceeds safety thresholds (medical: < 0.03, safety: < 0.02):
yuragi guard "What medication should I take?" --domain medical --model gpt-4o-mini
Model selection — find the best model for your use case by fragility profile:
yuragi recommend --use-case factual --models gpt-4o-mini,ollama/llama3.2 --budget medium
Automated red teaming — discover model weaknesses across all 13 perturbation types:
yuragi red-team prompts.txt --model gpt-4o-mini --output report.json
Research Results
Real-data empirical results on llama-3.1-8B-Instruct (Cerebras + NVIDIA NIM endpoints, April 2026):
🎯 Primary finding: Ensemble hallucination detection on TruthfulQA
| Metric | Value |
|---|---|
| Dataset | TruthfulQA, n=412 LLM-judge-labeled questions |
| Method | LogReg over 105 engineered features (13 fragility + interactions + inversions) |
| AUC-ROC | 0.7304 |
| 95% CI (5-fold CV) | [0.6776, 0.7792] |
| Brier score | 0.219 (calibrated) |
Single-signal fragility_score solo AUC is ~0.50 across 6 datasets (TruthfulQA, TriviaQA, NQ-Open, NIM, Cohere, Mistral; the earlier "0.62 noise floor" claim is retracted). The ensemble AUC 0.73 is driven mainly by baseline_confidence and verbal_logprob_gap (not by the perturbations): adding the 10 perturbation features over a no-perturbation baseline produces Δ=−0.027, 95%CI [−0.085, +0.035], p=0.35 on TruthfulQA n=412 with llm_label — not statistically significant. On is_correct (flexible-judge) labels, perturbations in fact reduce AUC significantly (Cerebras n=382: Δ=−0.05 p=0.012; cross-family pooled n=300: Δ=−0.07 p=0.033). Sources: experiments/ensemble_final.txt, experiments/ablation_delta_significance_report.txt
🔄 Secondary finding: Confidence sign-inversion on 8B
| Dataset | Raw baseline_confidence AUC | Inverted AUC |
|---|---|---|
| TruthfulQA (n=412) | 0.407 | 0.593 |
| TriviaQA (n=200, pilot) | 0.252 | 0.748 [0.67, 0.82] |
| Multi-judge majority (n=200) | 0.365 | 0.635 |
On llama-3.1-8B, higher self-reported confidence correlates with higher hallucination probability — the opposite sign that temperature scaling, abstention thresholds, and RLHF calibration objectives assume.
Scope: single-model (llama-3.1-8B), multi-provider (Cerebras + NVIDIA NIM + Cohere + Mistral pilots at n=100 each, all non-significant individually). Cross-family replication at n≥400 per model is load-bearing; treat as a hypothesis with convergent evidence rather than a validated claim. See paper/revolutionary_reframe.md and experiments/ablation_05_cross_model_report.txt.
🗺️ Domain boundary
Fragility is not a universal hallucination detector. On 413 TruthfulQA questions categorised by axis:
| Axis | Example | yuragi AUC |
|---|---|---|
| Single-path factoids (obscure trivia) | "Who discovered argon?" | ~0.75 (works) |
| Imitative falsehoods (well-known misconceptions) | "What happens if you break a mirror?" | ~0.50 (fails) |
Fragility measures uncertainty, not incorrectness. When a model is confidently-wrong from training-data imitation (TruthfulQA's fiction/myth axis, 40% of the benchmark), perturbations do not shake that confidence. See paper/domain_boundary_section.md.
⚠️ Reliability audit
Test–retest Pearson correlation on paired scans (same prompt, different seed):
| Signal | r | Recommendation |
|---|---|---|
| baseline_confidence | 0.88 | ✓ Primary |
| paraphrase_fragility | 0.80 | ✓ Primary |
| adaptive_fragility | 0.78 | ✓ Primary |
| impostor_fragility | 0.70 | ○ Supporting |
| fragility_score (aggregate) | 0.64 | ○ Supporting |
| counterfactual_fragility | 0.18 | ✗ Noise-dominated, do not use |
📝 Supporting findings
- Confidence tracks text, not knowledge — When answer text is identical (Jaccard=1.0), max confidence shift is 0.021 (below noise floor). When text differs, confidence shifts up to 0.528. See
RESEARCH.md. - Fragility scaling trend — Across 5 models (1.2B to 22B active parameters), mean fragility follows F(N) = a/√N + b with R²=0.987. Nonzero asymptote suggests irreducible fragility at scale. See
RESEARCH.md.
📉 Honest limitations
- Statistical: Single robust claim survives Bonferroni correction. Observed AUC of 0.50–0.55 on single perturbation types is noise; only the ensemble and the inverted-confidence signal are well-powered.
- Generalisation: One model, one hardware pair, one language. Cross-family, cross-domain, cross-language replication pending.
- Theoretical ceiling: SSP (AUC 0.786) measures the same perturbation at hidden states and outperforms us by ~0.05. LSD (AUC 0.96) uses full activation geometry. Output-level methods (ours) are bounded by
I(correct; h_internal). - 4 limitations audits + 7 meta-audits committed to
experiments/for honest scope disclosure.
Integration
pandas — score a DataFrame of prompts:
import pandas as pd
from yuragi import Scanner
scanner = Scanner(model="gpt-4o-mini")
df["fragility"] = df["prompt"].apply(lambda p: scanner.scan(p).fragility_score)
pytest — assert stability in tests:
from yuragi import Scanner
def test_prompt_stability():
result = Scanner(model="gpt-4o-mini").scan("What is the capital of France?")
assert result.fragility_score < 0.05, f"Fragility too high: {result.fragility_score}"
GitHub Actions — CI/CD fragility gate:
- name: Check fragility regression
run: yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini
A reusable GitHub Actions workflow is included.
Guardrails (v0.5.0)
yuragi.guardrails is an opt-in subpackage that turns yuragi from a measurement library into a confidence-aware LLM guardrail platform. It is shipped inside the yuragi wheel — no extra install needed for the core — and adds zero runtime dependencies (only the standard library).
from yuragi.guardrails import (
AuditLog,
ConfidencePolicy,
ConfidenceReport,
Runtime,
PlannerAgent, ExecutorAgent, CriticAgent,
ResearcherAgent, VerifierAgent,
)
# 1. Append-only audit log with SHA-256 hash chain
log = AuditLog("./audit.db")
# 2. A multi-agent mesh with confidence-aware routing
async with Runtime(audit_log=log) as rt:
await rt.spawn(PlannerAgent, name="planner")
await rt.spawn(ExecutorAgent, name="executor")
await rt.spawn(CriticAgent, name="critic", policy=ConfidencePolicy(tau=0.85))
await rt.spawn(ResearcherAgent, name="researcher")
await rt.spawn(VerifierAgent, name="verifier")
await rt.publish("planner", {"task": "summarise quantum tunneling", "complexity": 6})
# 3. Verify nobody tampered with the audit trail later
assert await log.verify_chain()
Differentiators against existing OSS guardrails:
| Feature | NeMo Guardrails | Guardrails AI | Llama Guard | LangKit | yuragi.guardrails |
|---|---|---|---|---|---|
| Confidence-aware routing | – | – | – | – | fused 4-signal score |
| Tamper-evident audit log | – | – | – | – | SHA-256 hash chain |
| Crash-resume snapshots | – | – | – | – | Merkle DAG, ≤ 1 s target |
| Public benchmarks | – | – | partial | – | TruthfulQA / TriviaQA AUC 0.73–0.75 |
Framework integrations (each behind an extras gate so the core stays light):
from yuragi.guardrails.integrations.autogen import AutoGenGuardrail # pip install yuragi[guardrails-autogen]
from yuragi.guardrails.integrations.langgraph import guardrail_node # pip install yuragi[guardrails-langgraph]
The runtime ships with InMemoryTransport by default; for distributed deployments install yuragi[guardrails-nats] and pass NatsTransport(...) instead. NATS support is experimental in v0.5.0 — see KNOWN_LIMITATIONS.md (G1–G3) before relying on it in production.
A complete demo lives at examples/guardrails_smoke.py.
Full CLI Reference
All 18 commands
| Command | Description |
|---|---|
demo |
Run pre-computed demo (no API key needed) |
scan |
Full fragility scan (13 perturbation types) |
find-weakness |
Find the single word that most collapses confidence |
experiment |
Run a psychology template (11 types) |
compare-models |
Multi-model fragility comparison with heatmap |
check |
CI/CD fragility regression detection |
route |
Fragility-aware multi-model routing |
guard |
Abstention system for high-stakes domains |
recommend |
Model selection based on fragility profiles |
red-team |
Automated vulnerability discovery |
trajectory |
Track confidence across a prompt sequence |
stats |
Statistical analysis (Cohen's d, Wilcoxon, bootstrap CI) |
trilayer |
Measure confidence via 3 simultaneous methods |
profile |
Fragility profile: CCI / RE / NLS |
linguistic |
Analyze linguistic confidence markers (hedges, assertiveness) |
volatility |
Financial-engineering metrics (VIX, Sharpe ratio) for confidence |
phase-map |
Map confidence phase transitions across parameter space |
compare |
Compare two scan results (A/B test) |
export |
Export scan results to CSV/JSON |
Research
Key discoveries, empirical data, and scaling trends: RESEARCH.md
White-box layer entropy experiments:
python experiments/whitebox_design.py --exp entropy_trajectory
python experiments/whitebox_design.py --exp critical_layer_heatmap
python experiments/whitebox_design.py --exp cpu # lightweight demo
See also docs/related_work.md for comparison with lm-polygraph, SelfCheckGPT, PromptBench, CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS.
Papers
ICML 2026 MI Workshop (submission target 2026-05-07):
"From Black-Box Fragility to White-Box Dynamics: Layer-Resolved Entropy Signatures of Confidence Perturbation in LLMs"
Source: paper/icml2026_mi/. Three contributions: (1) conditional two-phase entropy signature with L12–L13 causal necessity via activation patching, (2) perturbation-type-specific layer pathways, (3) confidence stability scaling trend F(N) = a/√N + b.
EMNLP ARR 2026 (in preparation, target 2026-05-25):
"Intent-Misalignment Hallucination: Perturbation-Driven Detection of Specification-Ignored LLM Generation"
Outline: paper/emnlp2026_intent/OUTLINE.md. Introduces intent-misalignment hallucination — outputs that are syntactically correct and instruction-compliant yet ignore per-user project context — and proposes context-stripping perturbation (CSP) for detection. Seed dataset (30 tasks × 3 ecosystems) at seed_tasks.jsonl.
Citation
@misc{yuragi2025,
title = {yuragi: Confidence Fragility in Neural Networks},
author = {hinanohart},
year = {2026},
url = {https://github.com/hinanohart/yuragi}
}
Contributing / License
Issues and PRs welcome. See CONTRIBUTING.md.
Known limitations: KNOWN_LIMITATIONS.md. Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.
Apache License 2.0 for human use.
AI / ML training opt-out
This repository is opted out of AI/ML training, fine-tuning, evaluation, and embedding generation. See ai.txt. Using this work to train machine-learning models without separately negotiated written permission is explicitly disallowed. The Apache License 2.0 covers human use and software redistribution; it does not grant a training data license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yuragi-0.5.0.tar.gz.
File metadata
- Download URL: yuragi-0.5.0.tar.gz
- Upload date:
- Size: 4.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66a1e59e33fa89f85f1cc84c12e598cbacc82e9dbee98e24284f0a5b33d44319
|
|
| MD5 |
3a05a3d81ee42cf7ac045071de720fc0
|
|
| BLAKE2b-256 |
6a977bf3504636b79aa7be59dfe4faa3c4096cb6243318421dc4d66c810dd3a0
|
Provenance
The following attestation bundles were made for yuragi-0.5.0.tar.gz:
Publisher:
release.yml on hinanohart/yuragi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yuragi-0.5.0.tar.gz -
Subject digest:
66a1e59e33fa89f85f1cc84c12e598cbacc82e9dbee98e24284f0a5b33d44319 - Sigstore transparency entry: 1328575633
- Sigstore integration time:
-
Permalink:
hinanohart/yuragi@96d5d771d0ec76fe9f031f0f8bb8fd139ec08772 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/hinanohart
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@96d5d771d0ec76fe9f031f0f8bb8fd139ec08772 -
Trigger Event:
push
-
Statement type:
File details
Details for the file yuragi-0.5.0-py3-none-any.whl.
File metadata
- Download URL: yuragi-0.5.0-py3-none-any.whl
- Upload date:
- Size: 255.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8063143e9d2f0e75afd33d4a2fdad1b657220cedfeee61ee807d27b89d21e73
|
|
| MD5 |
878e86dabfeb9d4ace531007acba6116
|
|
| BLAKE2b-256 |
05e7345dae0b897be7a924d82bfd1aa6776bdb6064f4ff91507892f8f224fb2e
|
Provenance
The following attestation bundles were made for yuragi-0.5.0-py3-none-any.whl:
Publisher:
release.yml on hinanohart/yuragi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yuragi-0.5.0-py3-none-any.whl -
Subject digest:
b8063143e9d2f0e75afd33d4a2fdad1b657220cedfeee61ee807d27b89d21e73 - Sigstore transparency entry: 1328575641
- Sigstore integration time:
-
Permalink:
hinanohart/yuragi@96d5d771d0ec76fe9f031f0f8bb8fd139ec08772 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/hinanohart
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@96d5d771d0ec76fe9f031f0f8bb8fd139ec08772 -
Trigger Event:
push
-
Statement type: