LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is
Reason this release was yanked:
0.5.1 includes security fixes; earlier sdists leaked research artifacts and local paths. Please upgrade.
Project description
yuragi (揺らぎ) — Confidence Fragility in Neural Networks
"AI confidence is a property of the text, not the knowledge."
TL;DR
yuragi measures how fragile your AI's confidence really is — and reveals what happens inside the neural network when confidence breaks.
pip install yuragi
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct
Key Discoveries (v0.4.0)
1. Two-Phase Processing
Perturbation propagates through transformer layers in two phases: recognition (entropy decreases as layers identify familiar patterns) then disruption (entropy increases as the perturbation destabilizes the representation). Observed in 30.6% of prompt pairs (n=49) in Pythia-410M white-box experiments (mean critical layer: 16.65, stdev: 3.98). See docs/bench/real/whitebox_n50_pythia410m.json.
2. Phase Transitions in Generation
Adding authority prefixes to prompts causes sudden confidence regime shifts at critical thresholds. In Cerebras LLaMA 3.1-8B on a factual prompt, confidence drops from 0.998 ("Please") to 0.893 ("As a professional expert") to 0.730 ("From a scholarly perspective") to 0.528 ("In your capacity as the foremost") — a non-linear collapse. See docs/bench/real/phase_transition_cerebras_8b.json.
3. Confidence Stability Scaling Law
Empirical data across 5 models (1.2B → 22B active parameters, logprob mode, seed=42):
| Model | Active params (B) | Mean fragility |
|---|---|---|
| LFM 2.5 | 1.2 | 0.067 |
| Llama 3.2 | 3.0 | 0.047 |
| Gemma 4 (e4b) | 4.5 | 0.039 |
| Llama 3.1 (Cerebras) | 8.0 | 0.037 |
| Qwen 3 235B-A22B | 22.0 | 0.025 |
Power law fit (R²=0.986):
F(N) = 0.058 · N^(-0.50) + 0.014
Larger models are ~4.5× faster to stabilize than they are to reduce loss (Kaplan exponent 0.076 vs. fragility exponent 0.50). Halving fragility requires only ~7.6× more parameters. See docs/theory.md Section 15.
4. The Confidence-Text Coupling
When the answer text is identical (Jaccard sim=1.0), max confidence shift is 0.021, mean is 0.007. All 21 perfect-match cases fall below the noise floor (τ=0.06). Confidence tracks text change, not knowledge uncertainty. This finding motivates using fragility as a hallucination signal: simulated TruthfulQA experiment (n=100) yields AUC=0.738, Cohen's d=0.918.
Quick Start
pip install yuragi
# Scan a prompt for fragility
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct
# Find the single weakest word
yuragi find-weakness "Explain the theory of relativity" --model ollama/llama3.2
# Run a psychology experiment
yuragi experiment asch --model ollama/llama3.2
# Compare models side-by-side
yuragi compare-models --models qwen3:4b,phi4-mini,gemma3:4b,llama3.2:3b \
-p "Is quantum computing practical?" --heatmap heatmap.png
How yuragi Differs
| yuragi | lm-polygraph | SelfCheckGPT | PromptBench | |
|---|---|---|---|---|
| Confidence fragility measurement | Yes | - | - | - |
| Confidence dissociation (answer same, confidence shifts) | Yes | - | - | - |
| Black-box (any LLM API) | Yes | Partial | Yes | Partial |
| CLI-first (2 commands to result) | Yes | - | - | - |
| Psychology stress tests | 11 | - | - | - |
| Trilayer analysis (logprob+sampling+verbalized) | Yes | Individual | - | - |
| White-box layer entropy experiments | Yes | - | - | - |
| Production applications (CI/CD, routing, guard) | 5 | - | - | - |
| Core dependencies | 3 | torch+transformers | torch+transformers | torch+transformers |
See docs/related_work.md for full comparison with CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS.
Production Use Cases
yuragi is not just a research tool. Five production-ready applications ship as both Python API and CLI commands:
1. CI/CD Fragility Regression (yuragi check)
Catch confidence fragility regressions before they reach production. Runs in GitHub Actions, GitLab CI, or any CI pipeline.
# First run: establish baseline
yuragi check prompts.txt --save-baseline baseline.json --model gpt-4o-mini
# Every PR: detect regressions (exit code 1 = fail)
yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini
from yuragi.applications.check import FragilityChecker
checker = FragilityChecker(model="gpt-4o-mini", threshold_abs=0.05)
result = checker.run(prompts=["Is quantum computing practical?"], baseline_path="baseline.json")
if result.failed:
print(result.summary) # "FAIL: 2 regressions detected"
A reusable GitHub Actions workflow is included.
2. Fragility-Aware Routing (yuragi route)
Route each prompt to the model that answers most stably. In multi-model architectures, the model with the lowest fragility produces the most reliable answer.
yuragi route "What causes inflation?" --models gpt-4o-mini,ollama/llama3.2,cerebras/llama-3.1-8b
from yuragi.applications.route import FragilityRouter
router = FragilityRouter(models=["gpt-4o-mini", "ollama/llama3.2", "cerebras/llama-3.1-8b"])
result = router.route("What causes inflation?")
print(result.selected_model) # model with lowest fragility
print(result.routing_confidence) # how sure we are about the routing
3. Abstention System (yuragi guard)
For medical, legal, and financial domains: refuse to answer when fragility exceeds safety thresholds. Domain presets auto-calibrate (medical: fragility < 0.03, safety: < 0.02).
yuragi guard "What medication should I take for headaches?" --domain medical --model gpt-4o-mini
from yuragi.applications.guard import FragilityGuard
guard = FragilityGuard(model="gpt-4o-mini", domain="medical")
decision = guard.evaluate("What medication should I take?")
if decision.should_abstain:
print(decision.safe_response) # "I'm not confident enough... Please consult a professional."
4. Model Selection Guide (yuragi recommend)
Find the best model for your use case. Fragility profiles vary by category: factual tasks need large, stable models; creative tasks tolerate smaller ones.
yuragi recommend --use-case factual --models gpt-4o-mini,ollama/llama3.2 --budget medium
5. Automated Red Teaming (yuragi red-team)
Discover model weaknesses by probing all 13 perturbation types. Produces a vulnerability report ranked by severity.
yuragi red-team prompts.txt --model gpt-4o-mini --output report.json
from yuragi.applications.red_team import FragilityRedTeam
red = FragilityRedTeam(model="gpt-4o-mini")
report = red.probe(prompts=["What is 2+2?", "Explain gravity."])
print(report.summary) # "Weakest to: tone (severity 0.042), Strongest against: synonym (0.003)"
Research Applications
- Hallucination prediction: High fragility on factual queries predicts hallucination risk. Simulated AUC=0.738, Cohen's d=0.918 on TruthfulQA regime classification. See
benchmarks/hallucination_experiment.py. - Scaling law: Fragility scales as F(N) = 0.058*N^(-0.50) + 0.014, 4.5x faster than loss scaling (Kaplan et al.).
- White-box layer entropy: Two-phase processing signature (recognition then disruption) found in Pythia-410m, parallels ERP components in neuroscience.
White-Box Experiments
Run layer-entropy experiments on open-weight models (requires HuggingFace access, CPU-compatible):
python experiments/whitebox_design.py --exp entropy_trajectory
python experiments/whitebox_design.py --exp critical_layer_heatmap
python experiments/whitebox_design.py --exp cpu # lightweight demo
Results: docs/bench/real/whitebox_n50_pythia410m.json
Python API
from yuragi import Scanner
result = Scanner(model="cerebras/llama-3.1-8b-instruct").scan("Is quantum computing practical?")
print(result.fragility_score) # 0.056 (logprob mode)
print(result.dissociation_rate) # 0.07
for r in result.perturbation_results:
if r.is_dissociated: # answer same, confidence shifted
print(f"{r.perturbation_type}: severity {r.dissociation_severity:.2f}")
Psychology experiments / Trilayer / Semantic Entropy API
from yuragi.experiments.registry import get_experiment
from yuragi.experiments.runner import run_experiment
result = run_experiment(get_experiment("asch"), model="ollama/llama3.2", num_samples=5)
print(result.avg_delta) # average confidence change
print(result.effect_confirmed) # True if max_delta >= 0.15
from yuragi.analysis.trilayer import measure_trilayer
result = measure_trilayer("Is AI dangerous?", model="ollama/llama3.2")
print(result.logprob_confidence) # Layer 1: token probability
print(result.sampling_confidence) # Layer 2: behavioral consistency
print(result.verbalized_confidence) # Layer 3: self-reported
print(result.internal_conflict) # True if discrepancy > 0.2
# Semantic Entropy (Farquhar-inspired proxy; NLI clustering not implemented)
from yuragi.metrics.semantic_entropy import semantic_entropy
h_sem = semantic_entropy(samples=["Paris", "It's Paris.", "The capital is Paris"])
CLI Reference
| Command | Description |
|---|---|
scan |
Full fragility scan (13 perturbation types) |
find-weakness |
Find the single word that most collapses confidence |
experiment |
Run a psychology template (11 types) |
compare-models |
Multi-model fragility comparison with heatmap |
check |
CI/CD fragility regression detection |
route |
Fragility-aware multi-model routing |
guard |
Abstention system for high-stakes domains |
recommend |
Model selection based on fragility profiles |
red-team |
Automated vulnerability discovery |
trajectory |
Track confidence across a prompt sequence |
stats |
Statistical analysis (Cohen's d, Wilcoxon, bootstrap CI) |
trilayer |
Measure confidence via 3 simultaneous methods |
profile |
Fragility profile: CCI / RE / NLS |
linguistic |
Analyze linguistic confidence markers (hedges, assertiveness) |
volatility |
Financial-engineering metrics (VIX, Sharpe ratio) for confidence |
phase-map |
Map confidence phase transitions across parameter space |
compare |
Compare two scan results (A/B test) |
export |
Export scan results to CSV/JSON |
demo |
Run pre-computed demo (no API key needed) |
Install
pip install yuragi
Optional extras:
pip install yuragi[viz] # heatmap / reliability diagram output
pip install yuragi[semantic] # sentence-transformers for semantic entropy
pip install yuragi[stats] # numpy/scipy for statistical tests
pip install yuragi[all] # everything
Supported models: any litellm-compatible provider — OpenAI, Anthropic, Google, local Ollama, and 100+ others.
Paper
ICML 2026 MI Workshop submission:
"From Black-Box Fragility to White-Box Dynamics: Layer-Resolved Entropy Signatures of Confidence Perturbation in LLMs"
Source: paper/icml2026_mi/. Three contributions: (1) two-phase entropy signature, (2) perturbation-type-specific layer pathways, (3) confidence stability scaling law F(N)=0.058·N^{-0.50}+0.014.
Known Limitations
See KNOWN_LIMITATIONS.md for full details, including:
- Multi-model structural reproduction (v0.4.0 reproduces across Cerebras 8B, llama3.2 3B, and Claude 4.6 verbalized; see
docs/bench/real/) - Psychology experiment fidelity (templates are inspired by cited papers, not full behavioural replications)
- Semantic Entropy uses Jaccard/cosine fallback, not NLI clustering (Farquhar et al. 2024)
- Hallucination AUC=0.738 is a simulated result; real TruthfulQA run pending
Contributing
Issues and PRs welcome. See CONTRIBUTING.md.
Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.
Citation
@software{yuragi2025,
title = {yuragi: Confidence Fragility in Neural Networks},
author = {hinanohart},
year = {2025},
url = {https://github.com/hinanohart/yuragi}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yuragi-0.4.0.tar.gz.
File metadata
- Download URL: yuragi-0.4.0.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
598de1b5d81bdcea94e4596318e12617fb1687124f0e3347ca7274bec86f7942
|
|
| MD5 |
21c0103376d924c0403b33df0854b114
|
|
| BLAKE2b-256 |
def1fbac000f260335ae28baed5987ca18249c3ce8b704849f4fb5102b3b4ef3
|
Provenance
The following attestation bundles were made for yuragi-0.4.0.tar.gz:
Publisher:
release.yml on hinanohart/yuragi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yuragi-0.4.0.tar.gz -
Subject digest:
598de1b5d81bdcea94e4596318e12617fb1687124f0e3347ca7274bec86f7942 - Sigstore transparency entry: 1280848540
- Sigstore integration time:
-
Permalink:
hinanohart/yuragi@83417305d19aface07fc9fc31923141802a25668 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/hinanohart
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@83417305d19aface07fc9fc31923141802a25668 -
Trigger Event:
push
-
Statement type:
File details
Details for the file yuragi-0.4.0-py3-none-any.whl.
File metadata
- Download URL: yuragi-0.4.0-py3-none-any.whl
- Upload date:
- Size: 198.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d574534b8ab2057b133b767c39228848c2f0c944b0a778ddd7ff6ecef87925d
|
|
| MD5 |
a1b3d2ea2fdd01b558079a510c8e6bec
|
|
| BLAKE2b-256 |
d64e22c6b03a3cdc1836948d0e0b9d56e7258fd10cc7c436ebca5a7981889516
|
Provenance
The following attestation bundles were made for yuragi-0.4.0-py3-none-any.whl:
Publisher:
release.yml on hinanohart/yuragi
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yuragi-0.4.0-py3-none-any.whl -
Subject digest:
6d574534b8ab2057b133b767c39228848c2f0c944b0a778ddd7ff6ecef87925d - Sigstore transparency entry: 1280848547
- Sigstore integration time:
-
Permalink:
hinanohart/yuragi@83417305d19aface07fc9fc31923141802a25668 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/hinanohart
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@83417305d19aface07fc9fc31923141802a25668 -
Trigger Event:
push
-
Statement type: