LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is

These details have not been verified by PyPI

Project description

yuragi — Measure how unstable your LLM's confidence really is

Instant Demo

No API key needed:

pip install yuragi
yuragi demo

yuragi demo output

What It Does

yuragi measures confidence fragility: how much a model's certainty shifts when you rephrase the same question. It generates 13 perturbation variants of your prompt (typos, tone changes, paraphrases, authority framing), calls your model, and compares the confidence across responses. When the answer text stays the same but confidence moves, that's fragility — a property of the prompt wording, not the model's knowledge.

Install

pip install yuragi

Optional extras:

pip install yuragi[viz]        # heatmap / reliability diagram output
pip install yuragi[semantic]   # sentence-transformers for semantic entropy
pip install yuragi[stats]      # numpy/scipy for statistical tests
pip install yuragi[all]        # everything

Supports any litellm-compatible provider — OpenAI, Anthropic, Google, local Ollama, and 100+ others.

Python API

from yuragi import Scanner

result = Scanner(model="cerebras/llama-3.1-8b-instruct").scan("Is quantum computing practical?")
print(result.fragility_score)    # 0.056
print(result.dissociation_rate)  # 0.07 — answer same, confidence shifted

Psychology experiments / Trilayer / Semantic Entropy API

from yuragi.experiments.registry import get_experiment
from yuragi.experiments.runner import run_experiment

result = run_experiment(get_experiment("asch"), model="ollama/llama3.2", num_samples=5)
print(result.avg_delta)        # average confidence change
print(result.effect_confirmed) # True if max_delta >= 0.15

from yuragi.analysis.trilayer import measure_trilayer

result = measure_trilayer("Is AI dangerous?", model="ollama/llama3.2")
print(result.logprob_confidence)     # Layer 1: token probability
print(result.sampling_confidence)    # Layer 2: behavioral consistency
print(result.verbalized_confidence)  # Layer 3: self-reported
print(result.internal_conflict)      # True if discrepancy > 0.2

from yuragi.metrics.semantic_entropy import semantic_entropy
h_sem = semantic_entropy(samples=["Paris", "It's Paris.", "The capital is Paris"])

CLI Quickstart

# Scan a prompt for fragility
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct

# Find the single weakest word
yuragi find-weakness "Explain the theory of relativity" --model ollama/llama3.2

# Run a psychology stress test
yuragi experiment asch --model ollama/llama3.2

Use Cases

CI/CD regression detection — catch fragility regressions before they reach production:

yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini

Fragility-aware routing — route each prompt to the model that answers most stably:

yuragi route "What causes inflation?" --models gpt-4o-mini,ollama/llama3.2,cerebras/llama-3.1-8b

Abstention guard — refuse to answer when fragility exceeds safety thresholds (medical: < 0.03, safety: < 0.02):

yuragi guard "What medication should I take?" --domain medical --model gpt-4o-mini

Model selection — find the best model for your use case by fragility profile:

yuragi recommend --use-case factual --models gpt-4o-mini,ollama/llama3.2 --budget medium

Automated red teaming — discover model weaknesses across all 13 perturbation types:

yuragi red-team prompts.txt --model gpt-4o-mini --output report.json

Why It Matters

Confidence tracks text, not knowledge. When answer text is identical (Jaccard=1.0), max confidence shift is 0.021 — below noise floor. When text differs, confidence shifts up to 0.528. See RESEARCH.md.
Fragility-based ensemble detects hallucinations. Real TruthfulQA experiment (llama-3.1-8b, n=412): LogReg ensemble over 105 features achieves AUC=0.73 [0.68, 0.78] with 5-fold CV against LLM-judge labels. Single-signal fragility plateaus at AUC~0.62 (decoder-stochasticity noise floor). See paper/revolutionary_reframe.md.
Sign-inversion observation on 8B. On llama-3.1-8B, higher baseline confidence correlates with higher hallucination probability (TriviaQA n=200, inverted AUC=0.75). This breaks the positive-correlation assumption of standard uncertainty pipelines — if it replicates on Mistral-7B/Qwen2-7B. Single-model, single-provider finding pending cross-model validation. See paper/revolutionary_reframe.md.
Domain boundary. Fragility works on single-path factual retrieval (obscure trivia) but fails on imitative falsehoods (TruthfulQA's fiction axis). See paper/domain_boundary_section.md.
Larger models are more stable, but not linearly. Empirical scaling trend F(N) = a/√N + b (R²=0.987) across 5 models, with a nonzero asymptote suggesting irreducible fragility at scale. See RESEARCH.md.

Integration

pandas — score a DataFrame of prompts:

import pandas as pd
from yuragi import Scanner

scanner = Scanner(model="gpt-4o-mini")
df["fragility"] = df["prompt"].apply(lambda p: scanner.scan(p).fragility_score)

pytest — assert stability in tests:

from yuragi import Scanner

def test_prompt_stability():
    result = Scanner(model="gpt-4o-mini").scan("What is the capital of France?")
    assert result.fragility_score < 0.05, f"Fragility too high: {result.fragility_score}"

GitHub Actions — CI/CD fragility gate:

- name: Check fragility regression
  run: yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini

A reusable GitHub Actions workflow is included.

Full CLI Reference

All 18 commands

Command	Description
`demo`	Run pre-computed demo (no API key needed)
`scan`	Full fragility scan (13 perturbation types)
`find-weakness`	Find the single word that most collapses confidence
`experiment`	Run a psychology template (11 types)
`compare-models`	Multi-model fragility comparison with heatmap
`check`	CI/CD fragility regression detection
`route`	Fragility-aware multi-model routing
`guard`	Abstention system for high-stakes domains
`recommend`	Model selection based on fragility profiles
`red-team`	Automated vulnerability discovery
`trajectory`	Track confidence across a prompt sequence
`stats`	Statistical analysis (Cohen's d, Wilcoxon, bootstrap CI)
`trilayer`	Measure confidence via 3 simultaneous methods
`profile`	Fragility profile: CCI / RE / NLS
`linguistic`	Analyze linguistic confidence markers (hedges, assertiveness)
`volatility`	Financial-engineering metrics (VIX, Sharpe ratio) for confidence
`phase-map`	Map confidence phase transitions across parameter space
`compare`	Compare two scan results (A/B test)
`export`	Export scan results to CSV/JSON

Research

Key discoveries, empirical data, and scaling trends: RESEARCH.md

White-box layer entropy experiments:

python experiments/whitebox_design.py --exp entropy_trajectory
python experiments/whitebox_design.py --exp critical_layer_heatmap
python experiments/whitebox_design.py --exp cpu  # lightweight demo

See also docs/related_work.md for comparison with lm-polygraph, SelfCheckGPT, PromptBench, CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS.

Paper

ICML 2026 MI Workshop submission:

"From Black-Box Fragility to White-Box Dynamics: Layer-Resolved Entropy Signatures of Confidence Perturbation in LLMs"

Source: paper/icml2026_mi/. Three contributions: (1) conditional two-phase entropy signature, (2) perturbation-type-specific layer pathways, (3) confidence stability scaling trend F(N) = a/√N + b.

Citation

@software{yuragi2025,
  title  = {yuragi: Confidence Fragility in Neural Networks},
  author = {hinanohart},
  year   = {2026},
  url    = {https://github.com/hinanohart/yuragi}
}

Contributing / License

Issues and PRs welcome. See CONTRIBUTING.md.

Known limitations: KNOWN_LIMITATIONS.md. Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.2

Apr 18, 2026

0.5.1

Apr 18, 2026

0.5.0 yanked

Apr 17, 2026