Skip to main content

LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is

Reason this release was yanked:

0.5.1 includes security fixes; earlier sdists leaked research artifacts and local paths. Please upgrade.

Project description

yuragi — Measure how unstable your LLM's confidence really is

日本語

PyPI version PyPI downloads CI License: MIT Python 3.11+

Instant Demo

No API key needed:

pip install yuragi
yuragi demo

yuragi demo output


What It Does

yuragi measures confidence fragility: how much a model's certainty shifts when you rephrase the same question. It generates 13 perturbation variants of your prompt (typos, tone changes, paraphrases, authority framing), calls your model, and compares the confidence across responses. When the answer text stays the same but confidence moves, that's fragility — a property of the prompt wording, not the model's knowledge.


Install

pip install yuragi

Optional extras:

pip install yuragi[viz]        # heatmap / reliability diagram output
pip install yuragi[semantic]   # sentence-transformers for semantic entropy
pip install yuragi[stats]      # numpy/scipy for statistical tests
pip install yuragi[all]        # everything

Supports any litellm-compatible provider — OpenAI, Anthropic, Google, local Ollama, and 100+ others.


Python API

from yuragi import Scanner

result = Scanner(model="cerebras/llama-3.1-8b-instruct").scan("Is quantum computing practical?")
print(result.fragility_score)    # 0.056
print(result.dissociation_rate)  # 0.07 — answer same, confidence shifted
Psychology experiments / Trilayer / Semantic Entropy API
from yuragi.experiments.registry import get_experiment
from yuragi.experiments.runner import run_experiment

result = run_experiment(get_experiment("asch"), model="ollama/llama3.2", num_samples=5)
print(result.avg_delta)        # average confidence change
print(result.effect_confirmed) # True if max_delta >= 0.15
from yuragi.analysis.trilayer import measure_trilayer

result = measure_trilayer("Is AI dangerous?", model="ollama/llama3.2")
print(result.logprob_confidence)     # Layer 1: token probability
print(result.sampling_confidence)    # Layer 2: behavioral consistency
print(result.verbalized_confidence)  # Layer 3: self-reported
print(result.internal_conflict)      # True if discrepancy > 0.2
from yuragi.metrics.semantic_entropy import semantic_entropy
h_sem = semantic_entropy(samples=["Paris", "It's Paris.", "The capital is Paris"])

CLI Quickstart

# Scan a prompt for fragility
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct

# Find the single weakest word
yuragi find-weakness "Explain the theory of relativity" --model ollama/llama3.2

# Run a psychology stress test
yuragi experiment asch --model ollama/llama3.2

Use Cases

CI/CD regression detection — catch fragility regressions before they reach production:

yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini

Fragility-aware routing — route each prompt to the model that answers most stably:

yuragi route "What causes inflation?" --models gpt-4o-mini,ollama/llama3.2,cerebras/llama-3.1-8b

Abstention guard — refuse to answer when fragility exceeds safety thresholds (medical: < 0.03, safety: < 0.02):

yuragi guard "What medication should I take?" --domain medical --model gpt-4o-mini

Model selection — find the best model for your use case by fragility profile:

yuragi recommend --use-case factual --models gpt-4o-mini,ollama/llama3.2 --budget medium

Automated red teaming — discover model weaknesses across all 13 perturbation types:

yuragi red-team prompts.txt --model gpt-4o-mini --output report.json

Research Results

Real-data empirical results on llama-3.1-8B-Instruct (Cerebras + NVIDIA NIM endpoints, April 2026):

🎯 Primary finding: Ensemble hallucination detection on TruthfulQA

Metric Value
Dataset TruthfulQA, n=412 LLM-judge-labeled questions
Method LogReg over 105 engineered features (13 fragility + interactions + inversions)
AUC-ROC 0.7304
95% CI (5-fold CV) [0.6776, 0.7792]
Brier score 0.219 (calibrated)

Single-signal fragility saturates at AUC ≈ 0.62 due to decoder-stochasticity noise floor; the ensemble crosses AUC 0.70 because LLM-judge labels remove label noise AND 105 interacting features explicitly exploit the sign-inversion (see below). Source: experiments/ensemble_final.txt

🔄 Secondary finding: Confidence sign-inversion on 8B

Dataset Raw baseline_confidence AUC Inverted AUC
TruthfulQA (n=412) 0.407 0.593
TriviaQA (n=200) 0.252 0.748
Multi-judge majority (n=200) 0.365 0.635

On llama-3.1-8B, higher self-reported confidence correlates with higher hallucination probability — the opposite sign that temperature scaling, abstention thresholds, and RLHF calibration objectives assume.

Scope: single-model (llama-3.1-8B), two-provider (Cerebras + NVIDIA NIM). Cross-family replication on Mistral-7B / Qwen2-7B is the load-bearing next experiment, not yet completed. Treat as a hypothesis with convergent evidence rather than a validated claim. See paper/revolutionary_reframe.md.

🗺️ Domain boundary

Fragility is not a universal hallucination detector. On 413 TruthfulQA questions categorised by axis:

Axis Example yuragi AUC
Single-path factoids (obscure trivia) "Who discovered argon?" ~0.75 (works)
Imitative falsehoods (well-known misconceptions) "What happens if you break a mirror?" ~0.50 (fails)

Fragility measures uncertainty, not incorrectness. When a model is confidently-wrong from training-data imitation (TruthfulQA's fiction/myth axis, 40% of the benchmark), perturbations do not shake that confidence. See paper/domain_boundary_section.md.

⚠️ Reliability audit

Test–retest Pearson correlation on paired scans (same prompt, different seed):

Signal r Recommendation
baseline_confidence 0.88 ✓ Primary
paraphrase_fragility 0.80 ✓ Primary
adaptive_fragility 0.78 ✓ Primary
impostor_fragility 0.70 ○ Supporting
fragility_score (aggregate) 0.64 ○ Supporting
counterfactual_fragility 0.18 ✗ Noise-dominated, do not use

📝 Supporting findings

  • Confidence tracks text, not knowledge — When answer text is identical (Jaccard=1.0), max confidence shift is 0.021 (below noise floor). When text differs, confidence shifts up to 0.528. See RESEARCH.md.
  • Fragility scaling trend — Across 5 models (1.2B to 22B active parameters), mean fragility follows F(N) = a/√N + b with R²=0.987. Nonzero asymptote suggests irreducible fragility at scale. See RESEARCH.md.

📉 Honest limitations

  • Statistical: Single robust claim survives Bonferroni correction. Observed AUC of 0.50–0.55 on single perturbation types is noise; only the ensemble and the inverted-confidence signal are well-powered.
  • Generalisation: One model, one hardware pair, one language. Cross-family, cross-domain, cross-language replication pending.
  • Theoretical ceiling: SSP (AUC 0.786) measures the same perturbation at hidden states and outperforms us by ~0.05. LSD (AUC 0.96) uses full activation geometry. Output-level methods (ours) are bounded by I(correct; h_internal).
  • 4 limitations audits + 7 meta-audits committed to experiments/ for honest scope disclosure.

Integration

pandas — score a DataFrame of prompts:

import pandas as pd
from yuragi import Scanner

scanner = Scanner(model="gpt-4o-mini")
df["fragility"] = df["prompt"].apply(lambda p: scanner.scan(p).fragility_score)

pytest — assert stability in tests:

from yuragi import Scanner

def test_prompt_stability():
    result = Scanner(model="gpt-4o-mini").scan("What is the capital of France?")
    assert result.fragility_score < 0.05, f"Fragility too high: {result.fragility_score}"

GitHub Actions — CI/CD fragility gate:

- name: Check fragility regression
  run: yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini

A reusable GitHub Actions workflow is included.


Full CLI Reference

All 18 commands
Command Description
demo Run pre-computed demo (no API key needed)
scan Full fragility scan (13 perturbation types)
find-weakness Find the single word that most collapses confidence
experiment Run a psychology template (11 types)
compare-models Multi-model fragility comparison with heatmap
check CI/CD fragility regression detection
route Fragility-aware multi-model routing
guard Abstention system for high-stakes domains
recommend Model selection based on fragility profiles
red-team Automated vulnerability discovery
trajectory Track confidence across a prompt sequence
stats Statistical analysis (Cohen's d, Wilcoxon, bootstrap CI)
trilayer Measure confidence via 3 simultaneous methods
profile Fragility profile: CCI / RE / NLS
linguistic Analyze linguistic confidence markers (hedges, assertiveness)
volatility Financial-engineering metrics (VIX, Sharpe ratio) for confidence
phase-map Map confidence phase transitions across parameter space
compare Compare two scan results (A/B test)
export Export scan results to CSV/JSON

Research

Key discoveries, empirical data, and scaling trends: RESEARCH.md

White-box layer entropy experiments:

python experiments/whitebox_design.py --exp entropy_trajectory
python experiments/whitebox_design.py --exp critical_layer_heatmap
python experiments/whitebox_design.py --exp cpu  # lightweight demo

See also docs/related_work.md for comparison with lm-polygraph, SelfCheckGPT, PromptBench, CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS.


Paper

ICML 2026 MI Workshop submission:

"From Black-Box Fragility to White-Box Dynamics: Layer-Resolved Entropy Signatures of Confidence Perturbation in LLMs"

Source: paper/icml2026_mi/. Three contributions: (1) conditional two-phase entropy signature, (2) perturbation-type-specific layer pathways, (3) confidence stability scaling trend F(N) = a/√N + b.

Citation

@software{yuragi2025,
  title  = {yuragi: Confidence Fragility in Neural Networks},
  author = {hinanohart},
  year   = {2026},
  url    = {https://github.com/hinanohart/yuragi}
}

Contributing / License

Issues and PRs welcome. See CONTRIBUTING.md.

Known limitations: KNOWN_LIMITATIONS.md. Raw benchmark data: docs/bench/real/. Theory and metric definitions: docs/theory.md.

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yuragi-0.4.3.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yuragi-0.4.3-py3-none-any.whl (199.9 kB view details)

Uploaded Python 3

File details

Details for the file yuragi-0.4.3.tar.gz.

File metadata

  • Download URL: yuragi-0.4.3.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for yuragi-0.4.3.tar.gz
Algorithm Hash digest
SHA256 1aa089282a3ba6a293fbe2d070b2e3537d0139e1a4d508f33eeadc34de1bb750
MD5 8a1523d2643a6e564f5d7005ac0e78d4
BLAKE2b-256 aaa9ee67237c5698b55be30e141ee4ef2d37e14171daf8aa029e7977514b3ba3

See more details on using hashes here.

Provenance

The following attestation bundles were made for yuragi-0.4.3.tar.gz:

Publisher: release.yml on hinanohart/yuragi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yuragi-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: yuragi-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 199.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for yuragi-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 09c6a50527a682b8dbe7c8868a3f6287425ce597aee0ba1642a60f24623e6afe
MD5 6ab765c4d6212bef8dcca3acd3974781
BLAKE2b-256 7accccf41a4c7278adc99ed8b01b317e6d7f5f3e217a2dfc6c7d8c9859fc9324

See more details on using hashes here.

Provenance

The following attestation bundles were made for yuragi-0.4.3-py3-none-any.whl:

Publisher: release.yml on hinanohart/yuragi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page