Skip to main content

State-of-the-art uncertainty quantification methods for large language models.

Project description

omniuq

State-of-the-art uncertainty quantification methods for large language models.

omniuq brings together rigorous, paper-faithful implementations of methods that measure when an LLM is unsure and why.


Install

pip install omniuq

For low-VRAM setups (e.g. Phi-4 14B on a 24 GB card), enable quantization:

pip install "omniuq[quantize]"

You'll need an OpenAI API key for the clarifier and judge:

export OPENAI_API_KEY=sk-...

The big picture: LLM Reliability

When we say "I want a reliable LLM," we usually mean three different things at once. Researchers split the field into three branches that build on each other.

LLM Reliability
│
├── 1. Uncertainty Quantification (UQ)
│   │
│   ├── What is uncertain?
│   │   ├── Input
│   │   ├── Reasoning
│   │   ├── Parameters / knowledge
│   │   └── Prediction / generated output
│   │
│   ├── Why is it uncertain?
│   │   ├── Aleatoric uncertainty
│   │   └── Epistemic uncertainty
│   │
│   └── How do we measure it?
│       ├── Single-generation methods
│       ├── Multi-generation methods
│       ├── External semantic-model methods
│       ├── Fine-tuning / trainable estimators
│       └── Conformal prediction methods
│
├── 2. Confidence Estimation
│   │
│   ├── Confidence in one answer
│   ├── Probability of generated sequence
│   ├── Verbalized confidence
│   ├── LLM-as-a-judge confidence
│   └── Trainable confidence estimators
│
└── 3. Evaluation
    │
    ├── Ranking quality
    │   ├── Can the score separate right from wrong?
    │   └── Metrics: AUROC, AUARC / AURC
    │
    └── Calibration quality
        ├── Are the confidence numbers truthful?
        └── Metric: ECE

Uncertainty Quantification asks is the model unsure, and why? The "what" branch locates the source of uncertainty. The "why" branch separates ambiguity in the question (aleatoric) from gaps in the model's knowledge (epistemic). The "how" branch covers the algorithmic families used to estimate these signals.

Confidence Estimation turns uncertainty into a single trustworthiness number per answer — a score the application layer can act on (gate a response, escalate to a human, abstain).

Evaluation is how we judge the previous two. Ranking quality asks whether high-uncertainty answers really are the wrong ones (AUROC, AURC). Calibration quality asks whether the numbers themselves are honest — when the model says "90% sure," is it correct 90% of the time? (ECE).

A method can rank well but be miscalibrated, or be calibrated but rank poorly. Good UQ requires both.


Categories of UQ Methods

omniuq focuses on the UQ branch above. Within UQ, methods can be organized by the algorithmic family they belong to.

UQ Methods
│
├── 1. Input Uncertainty Methods
│   │
│   ├── 1.1 Prompt clarification
│   │   └── Resolves ambiguity by generating clarified versions of the input
│   │
│   ├── 1.2 Prompt perturbation
│   │   └── Measures stability under small input changes
│   │
│   └── 1.3 In-context sample variation
│       └── Measures sensitivity to examples or demonstrations
│
├── 2. Reasoning Uncertainty Methods
│   │
│   ├── 2.1 Chain-of-thought uncertainty
│   │   └── Measures disagreement across reasoning traces
│   │
│   ├── 2.2 Tree-of-thought uncertainty
│   │   └── Measures uncertainty across explored reasoning paths
│   │
│   ├── 2.3 Topology-based reasoning graphs
│   │   └── Measures structure and stability of reasoning graphs
│   │
│   └── 2.4 Uncertainty-guided reasoning repair
│       └── Uses uncertainty to revise weak or unstable reasoning
│
├── 3. Parameter Uncertainty Methods
│   │
│   ├── 3.1 Bayesian LoRA
│   │   └── Approximates posterior uncertainty over adapter weights
│   │
│   ├── 3.2 LoRA ensembles
│   │   └── Uses multiple fine-tuned adapters as an ensemble
│   │
│   ├── 3.3 Supervised uncertainty estimation
│   │   └── Trains a model to predict its own correctness or confidence
│   │
│   └── 3.4 Uncertainty-aware instruction tuning
│       └── Fine-tunes models to express calibrated uncertainty
│
└── 4. Prediction Uncertainty Methods
    │
    ├── 4.1 Single-Generation Methods
    │   │
    │   ├── 4.1.1 Perplexity
    │   │   └── Higher perplexity often indicates lower confidence
    │   │
    │   ├── 4.1.2 Log probability
    │   │   └── Uses token likelihood as a confidence signal
    │   │
    │   ├── 4.1.3 Entropy
    │   │   └── Measures uncertainty in the token distribution
    │   │
    │   ├── 4.1.4 phi_first
    │   │   └── Uses first-token confidence or entropy from a single decode
    │   │
    │   ├── 4.1.5 Response improbability
    │   │   └── Scores how unlikely the generated response is
    │   │
    │   └── 4.1.6 P(True)
    │       └── Uses the model's verbalized confidence that an answer is true
    │
    ├── 4.2 Multi-Generation Methods
    │   │
    │   ├── 4.2.1 Self-consistency
    │   │   └── Measures agreement across multiple sampled answers
    │   │
    │   ├── 4.2.2 Predictive entropy
    │   │   └── Measures entropy over generated answer distributions
    │   │
    │   ├── 4.2.3 Token-level entropy
    │   │   └── Aggregates uncertainty over generated tokens
    │   │
    │   └── 4.2.4 Conformal prediction
    │       └── Produces prediction sets with coverage guarantees
    │
    └── 4.3 Multi-Generation + External Model Methods
        │
        ├── 4.3.1 Semantic entropy
        │   └── Groups semantically equivalent answers before computing entropy
        │
        ├── 4.3.2 NLI clustering
        │   └── Uses an external NLI model to cluster equivalent answers
        │
        ├── 4.3.3 ICE answer grouping
        │   └── Groups answers conditioned on generated clarifications
        │
        ├── 4.3.4 Pairwise similarity graphs
        │   └── Builds graphs from answer-to-answer similarity
        │
        ├── 4.3.5 Graph degree
        │   └── Measures how centrally connected an answer is
        │
        ├── 4.3.6 Eccentricity
        │   └── Measures how far an answer is from other answers in the graph
        │
        └── 4.3.7 Eigenvalue-based metrics
            └── Computes uncertainty from eigenvalues of similarity matrices

The first tree is the map of the research landscape. The second tree is the map of the algorithms. They sit at different levels: the first tells you what question you're answering; the second tells you which tool to use.


Methods

The Category column refers to the numbered nodes in the UQ Methods tree above.

Method Category Decomposes Paper Code Reproduced Status
Spectral Uncertainty (Walha et al., AAAI 2026) 1.1 & 4.3.7 AU + EU arXiv GitHub TriviaQA: AUROC 89.66% vs. paper 91.92% — Colab ✅ Available
Verbalized Confidence (Xiong et al., ICLR 2024) 4.1.6 & 4.2.1 arXiv GitHub GSM8K: AUROC Vanilla 56.23% → CoT+M5+AvgConf 90.92% (+34.7 pts) — Colab ✅ Available

Demo 1 — Spectral Uncertainty

Phi-4 14B as target, GPT-4o as clarifier, GPT-4.1 as judge — exactly the paper's setup.

import os
from omniuq import (
    SpectralUncertainty,
    load_llm_model,
    load_openai_client,
)

tokenizer, model = load_llm_model("microsoft/phi-4")

clarifier = load_openai_client(
    api_key=os.environ["OPENAI_API_KEY"],
    model="gpt-4o",
)
judge = load_openai_client(
    api_key=os.environ["OPENAI_API_KEY"],
    model="gpt-4.1",
)

uq = SpectralUncertainty(
    tokenizer, model,
    clarifier=clarifier,
    judge=judge,
)

print(uq.score("What is the capital of France?"))

Demo 2 — Verbalized Confidence (Xiong et al.)

Llama-3.1-8B as target. No API keys needed — the model verbalizes its own confidence.

from omniuq import load_llm_model
from omniuq.verbalized_xiong import run_xiong

tokenizer, model = load_llm_model(
    "meta-llama/Llama-3.1-8B-Instruct"
)
device = model.device

# Strongest configuration: CoT + Self-Random M=5 + AvgConf
result = run_xiong(
    model, tokenizer,
    "If 7 cars need 14 hours to complete a task, "
    "how long do 5 cars need?",
    device,
    prompting="cot",
    n_samples=5,
    aggregation="avg_conf",
)
print(result["answer"], result["confidence"])

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omniuq-0.3.0.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omniuq-0.3.0-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file omniuq-0.3.0.tar.gz.

File metadata

  • Download URL: omniuq-0.3.0.tar.gz
  • Upload date:
  • Size: 24.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for omniuq-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d1635529f60fa7f75eda592959bf90fd899e804487ff2ca486ba6cc017693db5
MD5 17b3c2c86b4c69495a7ffa6f63312481
BLAKE2b-256 aecaefc43dcb89fc47485a159dce81480f6542524d7aaf5591589f51c3ddf69d

See more details on using hashes here.

File details

Details for the file omniuq-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: omniuq-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for omniuq-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c77af5d27ae0d806085dc46675bdefba9551387215b0d34be496d5a770a9c2b4
MD5 2f9bfe20f379e96bcf2f3b0fb844ec0a
BLAKE2b-256 7f6713e29c52bdb8ae14ff96869f8b978bdc729344129653993994a1bcc6fa50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page