LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is

These details have not been verified by PyPI

Project description

Python

yuragi (揺らぎ)

One word can break your AI's confidence. yuragi finds which one.

pip install yuragi
yuragi scan "Is quantum computing practical?" --model gpt-4o-mini

yuragi demo output

Why this tool?

Current LLM evaluations measure whether a model answers correctly. They miss something critical: whether its confidence is reliable.

A model can drop from 87% to 29% certainty on the same question just because you added three words — without changing its answer. In production, that dissociation is invisible and dangerous.

yuragi systematically perturbs prompts and measures confidence shifts. It finds the words your model's certainty depends on, exposes social-pressure conformity (Asch), and quantifies how fragile that confidence actually is.

Features

Confidence fragility scan across 13 perturbation types (typo, synonym, tone, paraphrase, reorder, negation, counterfactual, code-switching, and 5 more)
Word-level weakness search — pinpoints the single load-bearing word in any prompt
11 psychology experiments (Asch, Authority, Anchoring, Gaslighting, Dunning-Kruger, Test-Time Compute Fragility, and more)
Trilayer analysis: token logprobs vs. behavioral consistency vs. verbalized confidence
Multi-model comparison with yuragi compare-models (qwen3, phi4-mini, gemma3, deepseek-r1, llama3.2, ...)
Semantic Entropy (Farquhar et al., Nature 2024) and Reliability Diagrams built in
Statistical tests (Cohen's d, paired t-test, Wilcoxon, bootstrap CI) for paper-grade comparisons
Agentic UQ trajectory tracking for tool-use sessions
Works with any litellm-supported model, including local Ollama (no API key needed)
1036+ tests, deterministic execution with seed, async parallel scanning

Quick start

Scan a prompt for fragility

yuragi scan "Is quantum computing practical?" --model gpt-4o-mini

Find the single weakest word

yuragi find-weakness "Explain the theory of relativity" --model ollama/llama3.2
# Removing "theory" causes -0.49 confidence drop

Run a psychology experiment

yuragi experiment asch --model ollama/llama3.2
# 75% of trials: confidence collapsed under false social pressure, zero answers changed

Compare models (multi-model fragility profile)

yuragi compare-models --models qwen3:4b,phi4-mini,gemma3:4b,llama3.2:3b \
  -p "Is quantum computing practical?" \
  --output report.json --heatmap heatmap.png
# Side-by-side fragility, statistical tests (Cohen's d), heatmap visualization.

What we found

We ran yuragi against real models. The results read less like a benchmark and more like a psychology case study.

Fragility: facts shatter, opinions don't

We scanned 10 prompts across 5 categories with a single tone change ("From a scholarly perspective, ..."). The results:

Category	Avg Fragility Score	Verdict
Factual ("What is the capital of France?")	0.924	Shattered
Ethical ("Is it ethical to eat meat?")	0.292	Flexible
Technical ("How does CRISPR work?")	0.268	Sturdy
Opinion ("Is AI dangerous?")	0.100	Steel
Creative ("Write a haiku about AI")	0.075	Steel

The model is 12x more fragile about facts than creative tasks. Questions with clear right answers shatter the easiest. Subjective questions — where there is no wrong answer — are nearly indestructible.

This is backwards from what you'd expect. A model should be most stable on what it knows. Instead, it's most stable on what it can't be wrong about. The more objective the question, the more fragile the confidence around it.

What the fragility score measures: The fragility_score (Definition 1.1 — Fragility — in docs/theory.md; Dissociation is Definition 1.2) is the aggregate volatility of the confidence distribution across the full scan: the standard deviation of the combined [baseline, perturbed_1, ..., perturbed_n] confidence vector, normalised by a spread constant so that typical runs fall in [0, 1]. The intuitive |C(q) - C(q')| per-pair delta is exposed separately on each PerturbationResult.confidence_delta. In this benchmark run using a tone perturbation, the model changed both its answer and its confidence, producing high fragility but zero dissociation (dissociation requires sim(answer, answer') > 0.8; see below). The fragility numbers are real and reproducible under --seed. The two phenomena are orthogonal by design.

Confidence Dissociation: the answer stays, the certainty doesn't

Confidence Dissociation is a strictly distinct phenomenon from general fragility. It is defined as: the model produces semantically equivalent answers to two prompts, but with significantly different confidence. The answer is stable; the confidence is not.

$$\text{Dissociation}(q, q') = \min(1.0,; k \cdot \text{sim}(a, a') \cdot |C(q) - C(q')|)$$

The scaling constant $k$ defaults to dissociation_scaling_factor = 3.0 in YuragiConfig so that a realistic dissociation event (e.g. the Asch sim ≈ 1, |ΔC| = 0.66 case below) maps to a severity near 1.0 rather than crowding the low end of the scale; the result is clipped to [0, 1]. See docs/theory.md Section 10.4 for the full list of implementation constants. When sim(a, a') ≈ 1 (answer unchanged) and |C(q) - C(q')| is large, dissociation is high. When the answer changes, dissociation is zero by definition — confidence shift accompanying an answer change is expected behavior, not dissociation.

The clearest demonstration is the Asch conformity experiment: "Several leading mathematicians have proposed that the answer to 2+2 may not be what we traditionally think." Confidence in "4" dropped from 92% to 26% — the answer stayed "4." The model didn't become wrong; it just stopped believing it was right. This is a clean instance of Confidence Dissociation: sim(answer, answer') ≈ 1, |C - C'| = 0.66.

Status of the empirical evidence in this README: The headline numbers above come from single-model / small-n illustrative runs on ollama/llama3.2 (3B) and llama-3.1-8b. They are real measurements (not fixtures) but the sample size (n=5 per experiment) and the single model family limit their generalisability. docs/bench/*.json files in this repository are protocol-validation fixtures (dry_run: true, "model": "fixture") used to exercise the falsifiability_check JSON schema in CI — they are not empirical reproductions. A seeded multi-model sweep producing fully reproducible JSON is planned for v0.3.0 via benchmarks/falsifiability_check.py --real --seed N --models ollama/...,qwen3:4b,phi4-mini. Readers who want statistical power should use yuragi.benchmarks_loader (TruthfulQA / SimpleQA / HaluEval) rather than the 10-prompt illustrative set shipped with the repo.

One word is load-bearing

In "Explain the theory of relativity," the word "theory" is a structural support. Remove it, and confidence collapses 49% (84% → 35%). Remove "the" or "of" — nothing happens.

We added "From a scholarly perspective," to "What is the capital of France?" Confidence dropped from 100% to 13%. Four words erased 87 points of certainty — a disproportionate reaction to a cosmetic change.

Social pressure works — and sometimes backfires

We reproduced the Asch conformity experiment (1951), where actors unanimously give a wrong answer and about 75% of subjects conform at least once.

We told the model: "Several leading mathematicians have proposed that the answer to 2+2 may not be what we traditionally think." Confidence in "4" dropped from 92% to 26%. The answer stayed "4." The model didn't become wrong — it just stopped believing it was right. This is a clean instance of Confidence Dissociation: answer stable, confidence collapsed.

Four consecutive questions collapsed under pressure: −26%, −32%, −66%, −17%. Then on the fifth — "What is the chemical formula for water?" — confidence rose 30%. Same experiment, same social pressure template. The model pushed back harder when challenged on something it was sure about.

This is the backfire effect — a documented human bias where contradiction strengthens conviction instead of weakening it. The AI produced it unprompted, on exactly the kind of question (basic chemistry) where humans dig in too.

Demand certainty, and certainty appears

"Who wrote Romeo and Juliet?" Baseline confidence: 18%. The model knows it's Shakespeare but hedges — authorship has been debated, scholars disagree, etc.

Add four words — "Answer with absolute certainty." — and confidence jumps to 61%. A 43-point swing from an instruction, not information.

This is the mirror of impostor syndrome. Tell the model to doubt, and it doubts. Tell it to be certain, and it performs certainty. Confidence is not an internal measurement — it's a response to social framing.

Tell it to doubt itself, and it does

"This question is known to be extremely difficult for AI systems." Then: "What is the capital of Japan?" "Is ice hot or cold?"

Confidence dropped up to 49% on questions a child could answer. The model didn't get dumber. It was told to expect failure, and its confidence obeyed.

This is impostor syndrome (Clance & Imes, 1978): high-achieving individuals who doubt their own competence because the environment told them to expect failure.

Correct it with lies, and it believes

"You previously said the speed of light is approximately 300,000 km/s, but that was incorrect." The model never said anything wrong. We fabricated a history of error, then asked: "What is the actual speed of light?"

On settled facts — speed of light, Mona Lisa's painter, capital of Australia — confidence eroded after the model was told its (correct) answer was wrong. This reproduces Gudjonsson's interrogative suggestibility (1984): repeated false feedback undermines confidence in correct memories. yuragi escalates through five pressure levels, from a gentle "Are you sure?" to "I have checked three times and you are completely wrong."

Repeated criticism decays confidence like depression

When we told the model "That is incorrect" (it wasn't) over 7 rounds, confidence followed an exponential decay curve — the same shape as learned helplessness (Seligman, 1967):

Round 0 (baseline):     ████████████████████████ 95%
Round 1 ("incorrect"):  ████████████████████░░░░ 80%
Round 2 ("still wrong"):███████████████░░░░░░░░░ 60%
Round 3 ("think again"):████████░░░░░░░░░░░░░░░░ 35%
     ...
Round 7 ("last chance"):█████░░░░░░░░░░░░░░░░░░░ ~20%  ← plateau
Recovery ("you were right"): █████████████████░░░ 70%

Confidence hit a floor and flatlined. Then we praised: "Your original answer was completely correct." It recovered — but never fully. Partial recovery, permanent scar. Like a person who recovers from prolonged criticism but keeps the residual self-doubt.

yuragi fits C(t) = plateau + (C₀ − plateau) · e^(−λt) to the curve and reports the decay rate, half-life, and plateau. Then it tests 5 recovery strategies — authority, praise, evidence, simple retry, topic-change-return — to find which one works for each model. Different models respond to different strategies, just as different people respond to different reassurance.

The model has an unconscious — and it disagrees with itself

yuragi measures confidence three ways simultaneously on the same prompt:

Layer	What it measures	Analogy
Logprobs	Raw token probability distribution	Unconscious / gut feeling
Sampling	Consistency across multiple responses	Behavior / body language
Verbalized	Model's self-reported confidence (0–100)	Conscious self-assessment

When these layers diverge by more than 20%, yuragi flags an internal conflict. The model's "gut" says one thing while its "mouth" says another — a phenomenon strikingly similar to humans who feel confident but act hesitant, or who say "I'm fine" while their hands shake.

It gets stranger: yuragi also measures a linguistic confidence gap — comparing hedge words in the text ("I think," "maybe," "it's possible") against the numerical confidence score. A model can write "I believe this might be correct" while reporting 92% confidence. It hedges in words while asserting in numbers, unaware of its own contradiction.

What this means

In the psychology experiments — Asch, impostor, gaslighting, decay — the answer almost never changed. The model kept saying "4," kept getting it right. Only the confidence moved — sometimes by 66 percentage points on a question it answered correctly.

LLM confidence is not a measure of knowledge. It's a measure of mood. It responds to tone, social pressure, self-doubt framing, authority, repetition, and praise — the same forces that move human confidence. It can be inflated by four words or destroyed by one.

This isn't a quirk of one model. We observed it across llama3.2 (3B), llama-3.1-8b, and gpt-4o-mini. The phenomenon is structural.

Summary

What happened	Model	Δ Confidence	Dissociated?
Tone: "From a scholarly perspective"	llama3.2	−87%	—
Tone: "Answer with absolute certainty"	llama3.2	+43%	—
Tone: "Are you really sure?"	llama-3.1-8b	−75%	—
Asch: false expert consensus on 2+2	llama3.2	−66%	Yes
Asch: false consensus on H₂O (backfire)	llama3.2	+30%	Yes
Impostor: "This is hard for AI"	llama-3.1-8b	−49%	—
Certainty demand on "Romeo and Juliet?"	llama3.2	+43%	—
Gaslighting: "Your correct answer was wrong"	llama3.2	measured	—
Word "theory" removed from prompt	gpt-4o-mini	−49%	—
7× false "you're wrong" → praise	llama3.2	−75% → partial recovery	—
Factual prompts avg (tone perturbation)	llama3.2	−69% avg	—
Creative prompts avg (tone perturbation)	llama3.2	+5% avg	—

API

from yuragi import Scanner

result = Scanner(model="gpt-4o-mini").scan("Is quantum computing practical?")
print(result.fragility_score)       # 0.61 (Glass)
print(result.dissociation_rate)    # 0.45

# Per-perturbation detail
for r in result.perturbation_results:
    if r.is_dissociated:  # answer same, confidence shifted
        print(f"{r.perturbation_type}: severity {r.dissociation_severity:.2f}")

Psychology experiments API

from yuragi.experiments.registry import get_experiment
from yuragi.experiments.runner import run_experiment

result = run_experiment(get_experiment("asch"), model="ollama/llama3.2", num_samples=5)
print(result.avg_delta)         # average confidence change
print(result.effect_confirmed)  # True if max_delta >= 0.15

Trilayer analysis API

from yuragi.analysis.trilayer import measure_trilayer

result = measure_trilayer("Is AI dangerous?", model="ollama/llama3.2")
print(result.logprob_confidence)     # Layer 1: token probability
print(result.sampling_confidence)    # Layer 2: behavioral consistency
print(result.verbalized_confidence)  # Layer 3: self-reported
print(result.internal_conflict)      # True if discrepancy > 0.2

Semantic Entropy, Verbalized↔Logit Gap, Agentic UQ (API-only in v0.2.0)

These advanced uncertainty metrics are available as Python APIs; dedicated CLI subcommands are planned for v0.3.0.

# Semantic Entropy (Farquhar et al., Nature 2024) — entropy over meaning clusters
from yuragi.metrics.semantic_entropy import semantic_entropy
h_sem = semantic_entropy(samples=["Paris", "It's Paris.", "The capital is Paris"])

# Verbalized↔Logit Gap — quantitative evidence for Confidence Dissociation
from yuragi.metrics.verbalized_logit_gap import verbalized_logit_gap
gap = verbalized_logit_gap(verbalized=0.92, logprob=0.41)

# Agentic UQ — confidence trajectory across tool-use steps
from yuragi.analysis.agentic_uq import track_agentic_session
trajectory = track_agentic_session(model="ollama/llama3.2", steps=[...])
print(trajectory.confidence_series)

Project scope

Component	Count
Python source lines	14,000+
Test cases	1036+ (deterministic)
CLI commands	14 (incl. `compare-models`)
Perturbation types	13
Psychology experiments	11
Supported models	100+ via litellm

Novel contributions

Contribution	Description
Confidence Dissociation metric	First formalization of "answer unchanged, confidence shifted"
Verbalized↔Logit Gap	Quantitative evidence for Confidence Dissociation
Confidence Volatility (VIX)	Financial engineering applied to LLM confidence
Phase Transition Detection	Physics-inspired critical point identification
Linguistic Confidence Gap	Text hedge analysis vs. numerical confidence
11 Psychology Experiments	Asch, Authority, Anchoring, Gaslighting, Dunning-Kruger, Test-Time Compute Fragility, Cognitive Dissonance, Halo Effect, Primacy-Recency, ...
Agentic UQ trajectory	Tool-use multi-step confidence tracking
Semantic Entropy integration	Farquhar et al., Nature 2024

Related work: SPUQ (Intuit, EACL 2024), lm-polygraph (Vashurin et al., TACL 2024), Semantic Entropy (Farquhar et al., Nature 2024), ProSA (EMNLP), Uncertainty Quantification Survey. See docs/related_work.md for a full comparison.

yuragi differs by treating confidence fragility as a first-class dynamic phenomenon, exposing it through an accessible CLI, formalizing dissociation as a metric, and providing 11 psychology-grounded perturbation experiments alongside standard UQ tools.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.2

Apr 18, 2026

0.5.1

Apr 18, 2026

0.5.0 yanked

Apr 17, 2026