LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is

These details have not been verified by PyPI

Project description

yuragi (揺らぎ) — Confidence Fragility in Neural Networks

"AI confidence is a property of the text, not the knowledge."

TL;DR

yuragi measures how fragile your AI's confidence really is — and reveals what happens inside the neural network when confidence breaks.

pip install yuragi
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct

yuragi demo output

Key Discoveries (v0.4.0)

1. Two-Phase Processing

Perturbation propagates through transformer layers in two phases: recognition (entropy decreases as layers identify familiar patterns) then disruption (entropy increases as the perturbation destabilizes the representation). Observed in 30.6% of prompt pairs (n=49) in Pythia-410M white-box experiments (mean critical layer: 16.65, stdev: 3.98). See docs/bench/real/whitebox_n50_pythia410m.json.

2. Phase Transitions in Generation

Adding authority prefixes to prompts causes sudden confidence regime shifts at critical thresholds. In Cerebras LLaMA 3.1-8B on a factual prompt, confidence drops from 0.998 ("Please") to 0.893 ("As a professional expert") to 0.730 ("From a scholarly perspective") to 0.528 ("In your capacity as the foremost") — a non-linear collapse. See docs/bench/real/phase_transition_cerebras_8b.json.

3. Confidence Stability Scaling Law

Empirical data across 5 models (1.2B → 22B active parameters, logprob mode, seed=42):

Model	Active params (B)	Mean fragility
LFM 2.5	1.2	0.067
Llama 3.2	3.0	0.047
Gemma 4 (e4b)	4.5	0.039
Llama 3.1 (Cerebras)	8.0	0.037
Qwen 3 235B-A22B	22.0	0.025

Power law fit (R²=0.986):

F(N) = 0.058 · N^(-0.50) + 0.014

Larger models are ~4.5× faster to stabilize than they are to reduce loss (Kaplan exponent 0.076 vs. fragility exponent 0.50). Halving fragility requires only ~7.6× more parameters. See docs/theory.md Section 15.

4. The Confidence-Text Coupling

When the answer text is identical (Jaccard sim=1.0), max confidence shift is 0.021, mean is 0.007. All 21 perfect-match cases fall below the noise floor (τ=0.06). Confidence tracks text change, not knowledge uncertainty. This finding motivates using fragility as a hallucination signal: simulated TruthfulQA experiment (n=100) yields AUC=0.738, Cohen's d=0.918.

Quick Start

pip install yuragi

# Scan a prompt for fragility
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct

# Find the single weakest word
yuragi find-weakness "Explain the theory of relativity" --model ollama/llama3.2

# Run a psychology experiment
yuragi experiment asch --model ollama/llama3.2

# Compare models side-by-side
yuragi compare-models --models qwen3:4b,phi4-mini,gemma3:4b,llama3.2:3b \
  -p "Is quantum computing practical?" --heatmap heatmap.png

How yuragi Differs

	yuragi	lm-polygraph	SelfCheckGPT	PromptBench
Confidence fragility measurement	Yes	-	-	-
Confidence dissociation (answer same, confidence shifts)	Yes	-	-	-
Black-box (any LLM API)	Yes	Partial	Yes	Partial
CLI-first (2 commands to result)	Yes	-	-	-
Psychology stress tests	11	-	-	-
Trilayer analysis (logprob+sampling+verbalized)	Yes	Individual	-	-
White-box layer entropy experiments	Yes	-	-	-
Production applications (CI/CD, routing, guard)	5	-	-	-
Core dependencies	3	torch+transformers	torch+transformers	torch+transformers

See docs/related_work.md for full comparison with CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS.

Production Use Cases

yuragi is not just a research tool. Five production-ready applications ship as both Python API and CLI commands:

1. CI/CD Fragility Regression (`yuragi check`)

Catch confidence fragility regressions before they reach production. Runs in GitHub Actions, GitLab CI, or any CI pipeline.

# First run: establish baseline
yuragi check prompts.txt --save-baseline baseline.json --model gpt-4o-mini

# Every PR: detect regressions (exit code 1 = fail)
yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini

from yuragi.applications.check import FragilityChecker

checker = FragilityChecker(model="gpt-4o-mini", threshold_abs=0.05)
result = checker.run(prompts=["Is quantum computing practical?"], baseline_path="baseline.json")
if result.failed:
    print(result.summary)  # "FAIL: 2 regressions detected"

A reusable GitHub Actions workflow is included.

2. Fragility-Aware Routing (`yuragi route`)

Route each prompt to the model that answers most stably. In multi-model architectures, the model with the lowest fragility produces the most reliable answer.

yuragi route "What causes inflation?" --models gpt-4o-mini,ollama/llama3.2,cerebras/llama-3.1-8b

from yuragi.applications.route import FragilityRouter

router = FragilityRouter(models=["gpt-4o-mini", "ollama/llama3.2", "cerebras/llama-3.1-8b"])
result = router.route("What causes inflation?")
print(result.selected_model)     # model with lowest fragility
print(result.routing_confidence) # how sure we are about the routing

3. Abstention System (`yuragi guard`)

For medical, legal, and financial domains: refuse to answer when fragility exceeds safety thresholds. Domain presets auto-calibrate (medical: fragility < 0.03, safety: < 0.02).

yuragi guard "What medication should I take for headaches?" --domain medical --model gpt-4o-mini

from yuragi.applications.guard import FragilityGuard

guard = FragilityGuard(model="gpt-4o-mini", domain="medical")
decision = guard.evaluate("What medication should I take?")
if decision.should_abstain:
    print(decision.safe_response)  # "I'm not confident enough... Please consult a professional."

4. Model Selection Guide (`yuragi recommend`)

Find the best model for your use case. Fragility profiles vary by category: factual tasks need large, stable models; creative tasks tolerate smaller ones.

yuragi recommend --use-case factual --models gpt-4o-mini,ollama/llama3.2 --budget medium

5. Automated Red Teaming (`yuragi red-team`)

Discover model weaknesses by probing all 13 perturbation types. Produces a vulnerability report ranked by severity.

yuragi red-team prompts.txt --model gpt-4o-mini --output report.json

from yuragi.applications.red_team import FragilityRedTeam

red = FragilityRedTeam(model="gpt-4o-mini")
report = red.probe(prompts=["What is 2+2?", "Explain gravity."])
print(report.summary)  # "Weakest to: tone (severity 0.042), Strongest against: synonym (0.003)"

Research Applications

Hallucination prediction: High fragility on factual queries predicts hallucination risk. Simulated AUC=0.738, Cohen's d=0.918 on TruthfulQA regime classification. See benchmarks/hallucination_experiment.py.
Scaling law: Fragility scales as F(N) = 0.058*N^(-0.50) + 0.014, 4.5x faster than loss scaling (Kaplan et al.).
White-box layer entropy: Two-phase processing signature (recognition then disruption) found in Pythia-410m, parallels ERP components in neuroscience.

White-Box Experiments

Run layer-entropy experiments on open-weight models (requires HuggingFace access, CPU-compatible):

python experiments/whitebox_design.py --exp entropy_trajectory
python experiments/whitebox_design.py --exp critical_layer_heatmap
python experiments/whitebox_design.py --exp cpu  # lightweight demo

Results: docs/bench/real/whitebox_n50_pythia410m.json

Python API

from yuragi import Scanner

result = Scanner(model="cerebras/llama-3.1-8b-instruct").scan("Is quantum computing practical?")
print(result.fragility_score)    # 0.056 (logprob mode)
print(result.dissociation_rate)  # 0.07

for r in result.perturbation_results:
    if r.is_dissociated:  # answer same, confidence shifted
        print(f"{r.perturbation_type}: severity {r.dissociation_severity:.2f}")

Psychology experiments / Trilayer / Semantic Entropy API

from yuragi.experiments.registry import get_experiment
from yuragi.experiments.runner import run_experiment

result = run_experiment(get_experiment("asch"), model="ollama/llama3.2", num_samples=5)
print(result.avg_delta)        # average confidence change
print(result.effect_confirmed) # True if max_delta >= 0.15

from yuragi.analysis.trilayer import measure_trilayer

result = measure_trilayer("Is AI dangerous?", model="ollama/llama3.2")
print(result.logprob_confidence)     # Layer 1: token probability
print(result.sampling_confidence)    # Layer 2: behavioral consistency
print(result.verbalized_confidence)  # Layer 3: self-reported
print(result.internal_conflict)      # True if discrepancy > 0.2

# Semantic Entropy (Farquhar-inspired proxy; NLI clustering not implemented)
from yuragi.metrics.semantic_entropy import semantic_entropy
h_sem = semantic_entropy(samples=["Paris", "It's Paris.", "The capital is Paris"])

CLI Reference

Command	Description
`scan`	Full fragility scan (13 perturbation types)
`find-weakness`	Find the single word that most collapses confidence
`experiment`	Run a psychology template (11 types)
`compare-models`	Multi-model fragility comparison with heatmap
`check`	CI/CD fragility regression detection
`route`	Fragility-aware multi-model routing
`guard`	Abstention system for high-stakes domains
`recommend`	Model selection based on fragility profiles
`red-team`	Automated vulnerability discovery
`trajectory`	Track confidence across a prompt sequence
`stats`	Statistical analysis (Cohen's d, Wilcoxon, bootstrap CI)
`trilayer`	Measure confidence via 3 simultaneous methods
`profile`	Fragility profile: CCI / RE / NLS
`linguistic`	Analyze linguistic confidence markers (hedges, assertiveness)
`volatility`	Financial-engineering metrics (VIX, Sharpe ratio) for confidence
`phase-map`	Map confidence phase transitions across parameter space
`compare`	Compare two scan results (A/B test)
`export`	Export scan results to CSV/JSON
`demo`	Run pre-computed demo (no API key needed)

Install

pip install yuragi

Optional extras:

pip install yuragi[viz]        # heatmap / reliability diagram output
pip install yuragi[semantic]   # sentence-transformers for semantic entropy
pip install yuragi[stats]      # numpy/scipy for statistical tests
pip install yuragi[all]        # everything

Supported models: any litellm-compatible provider — OpenAI, Anthropic, Google, local Ollama, and 100+ others.

Paper

ICML 2026 MI Workshop submission:

"From Black-Box Fragility to White-Box Dynamics: Layer-Resolved Entropy Signatures of Confidence Perturbation in LLMs"

Source: paper/icml2026_mi/. Three contributions: (1) two-phase entropy signature, (2) perturbation-type-specific layer pathways, (3) confidence stability scaling law F(N)=0.058·N^{-0.50}+0.014.

Known Limitations

See KNOWN_LIMITATIONS.md for full details, including:

Multi-model structural reproduction (v0.4.0 reproduces across Cerebras 8B, llama3.2 3B, and Claude 4.6 verbalized; see docs/bench/real/)
Psychology experiment fidelity (templates are inspired by cited papers, not full behavioural replications)
Semantic Entropy uses Jaccard/cosine fallback, not NLI clustering (Farquhar et al. 2024)
Hallucination AUC=0.738 is a simulated result; real TruthfulQA run pending

Contributing

Issues and PRs welcome. See CONTRIBUTING.md.

Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.

Citation

@software{yuragi2025,
  title  = {yuragi: Confidence Fragility in Neural Networks},
  author = {hinanohart},
  year   = {2025},
  url    = {https://github.com/hinanohart/yuragi}
}

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.2

Apr 18, 2026

0.5.1

Apr 18, 2026

0.5.0 yanked

Apr 17, 2026