State-of-the-art uncertainty quantification methods for large language models.

These details have not been verified by PyPI

Project links

Project description

omniuq

State-of-the-art uncertainty quantification methods for large language models.

omniuq brings together rigorous, paper-faithful implementations of methods that measure when an LLM is unsure and why.

Install

pip install omniuq

For low-VRAM setups (e.g. Phi-4 14B on a 24 GB card), enable quantization:

pip install "omniuq[quantize]"

You'll need an OpenAI API key for the clarifier and judge:

export OPENAI_API_KEY=sk-...

The big picture: LLM Reliability

When we say "I want a reliable LLM," we usually mean three different things at once. Researchers split the field into three branches that build on each other.

LLM Reliability
│
├── 1. Uncertainty Quantification (UQ)
│   │
│   ├── What is uncertain?
│   │   ├── Input
│   │   ├── Reasoning
│   │   ├── Parameters / knowledge
│   │   └── Prediction / generated output
│   │
│   ├── Why is it uncertain?
│   │   ├── Aleatoric uncertainty
│   │   └── Epistemic uncertainty
│   │
│   └── How do we measure it?
│       ├── Single-generation methods
│       ├── Multi-generation methods
│       ├── External semantic-model methods
│       ├── Fine-tuning / trainable estimators
│       └── Conformal prediction methods
│
├── 2. Confidence Estimation
│   │
│   ├── Confidence in one answer
│   ├── Probability of generated sequence
│   ├── Verbalized confidence
│   ├── LLM-as-a-judge confidence
│   └── Trainable confidence estimators
│
└── 3. Evaluation
    │
    ├── Ranking quality
    │   ├── Can the score separate right from wrong?
    │   └── Metrics: AUROC, AUARC / AURC
    │
    └── Calibration quality
        ├── Are the confidence numbers truthful?
        └── Metric: ECE

Uncertainty Quantification asks is the model unsure, and why? The "what" branch locates the source of uncertainty. The "why" branch separates ambiguity in the question (aleatoric) from gaps in the model's knowledge (epistemic). The "how" branch covers the algorithmic families used to estimate these signals.

Confidence Estimation turns uncertainty into a single trustworthiness number per answer — a score the application layer can act on (gate a response, escalate to a human, abstain).

Evaluation is how we judge the previous two. Ranking quality asks whether high-uncertainty answers really are the wrong ones (AUROC, AURC). Calibration quality asks whether the numbers themselves are honest — when the model says "90% sure," is it correct 90% of the time? (ECE).

A method can rank well but be miscalibrated, or be calibrated but rank poorly. Good UQ requires both.

Categories of UQ Methods

omniuq focuses on the UQ branch above. Within UQ, methods can be organized by the algorithmic family they belong to.

UQ Methods
│
├── 1. Input Uncertainty Methods
│   │
│   ├── 1.1 Prompt clarification
│   │   └── Resolves ambiguity by generating clarified versions of the input
│   │
│   ├── 1.2 Prompt perturbation
│   │   └── Measures stability under small input changes
│   │
│   └── 1.3 In-context sample variation
│       └── Measures sensitivity to examples or demonstrations
│
├── 2. Reasoning Uncertainty Methods
│   │
│   ├── 2.1 Chain-of-thought uncertainty
│   │   └── Measures disagreement across reasoning traces
│   │
│   ├── 2.2 Tree-of-thought uncertainty
│   │   └── Measures uncertainty across explored reasoning paths
│   │
│   ├── 2.3 Topology-based reasoning graphs
│   │   └── Measures structure and stability of reasoning graphs
│   │
│   └── 2.4 Uncertainty-guided reasoning repair
│       └── Uses uncertainty to revise weak or unstable reasoning
│
├── 3. Parameter Uncertainty Methods
│   │
│   ├── 3.1 Bayesian LoRA
│   │   └── Approximates posterior uncertainty over adapter weights
│   │
│   ├── 3.2 LoRA ensembles
│   │   └── Uses multiple fine-tuned adapters as an ensemble
│   │
│   ├── 3.3 Supervised uncertainty estimation
│   │   └── Trains a model to predict its own correctness or confidence
│   │
│   └── 3.4 Uncertainty-aware instruction tuning
│       └── Fine-tunes models to express calibrated uncertainty
│
└── 4. Prediction Uncertainty Methods
    │
    ├── 4.1 Single-Generation Methods
    │   │
    │   ├── 4.1.1 Perplexity
    │   │   └── Higher perplexity often indicates lower confidence
    │   │
    │   ├── 4.1.2 Log probability
    │   │   └── Uses token likelihood as a confidence signal
    │   │
    │   ├── 4.1.3 Entropy
    │   │   └── Measures uncertainty in the token distribution
    │   │
    │   ├── 4.1.4 phi_first
    │   │   └── Uses first-token confidence or entropy from a single decode
    │   │
    │   ├── 4.1.5 Response improbability
    │   │   └── Scores how unlikely the generated response is
    │   │
    │   └── 4.1.6 P(True)
    │       └── Uses the model's verbalized confidence that an answer is true
    │
    ├── 4.2 Multi-Generation Methods
    │   │
    │   ├── 4.2.1 Self-consistency
    │   │   └── Measures agreement across multiple sampled answers
    │   │
    │   ├── 4.2.2 Predictive entropy
    │   │   └── Measures entropy over generated answer distributions
    │   │
    │   ├── 4.2.3 Token-level entropy
    │   │   └── Aggregates uncertainty over generated tokens
    │   │
    │   └── 4.2.4 Conformal prediction
    │       └── Produces prediction sets with coverage guarantees
    │
    └── 4.3 Multi-Generation + External Model Methods
        │
        ├── 4.3.1 Semantic entropy
        │   └── Groups semantically equivalent answers before computing entropy
        │
        ├── 4.3.2 NLI clustering
        │   └── Uses an external NLI model to cluster equivalent answers
        │
        ├── 4.3.3 ICE answer grouping
        │   └── Groups answers conditioned on generated clarifications
        │
        ├── 4.3.4 Pairwise similarity graphs
        │   └── Builds graphs from answer-to-answer similarity
        │
        ├── 4.3.5 Graph degree
        │   └── Measures how centrally connected an answer is
        │
        ├── 4.3.6 Eccentricity
        │   └── Measures how far an answer is from other answers in the graph
        │
        └── 4.3.7 Eigenvalue-based metrics
            └── Computes uncertainty from eigenvalues of similarity matrices

The first tree is the map of the research landscape. The second tree is the map of the algorithms. They sit at different levels: the first tells you what question you're answering; the second tells you which tool to use.

Methods

The Category column refers to the numbered nodes in the UQ Methods tree above.

Method	Category	Decomposes	Paper	Code	Reproduced	Status
Spectral Uncertainty (Walha et al., AAAI 2026)	1.1 & 4.3.7	AU + EU	arXiv	GitHub	TriviaQA: AUROC 89.66% vs. paper 91.92% — Colab	✅ Available
Verbalized Confidence (Xiong et al., ICLR 2024)	4.1.6 & 4.2.1	—	arXiv	GitHub	GSM8K: AUROC Vanilla 56.23% → CoT+M5+AvgConf 90.92% (+34.7 pts) — Colab	✅ Available

Demo 1 — Spectral Uncertainty

Phi-4 14B as target, GPT-4o as clarifier, GPT-4.1 as judge — exactly the paper's setup.

import os
from omniuq import (
    SpectralUncertainty,
    load_llm_model,
    load_openai_client,
)

tokenizer, model = load_llm_model("microsoft/phi-4")

clarifier = load_openai_client(
    api_key=os.environ["OPENAI_API_KEY"],
    model="gpt-4o",
)
judge = load_openai_client(
    api_key=os.environ["OPENAI_API_KEY"],
    model="gpt-4.1",
)

uq = SpectralUncertainty(
    tokenizer, model,
    clarifier=clarifier,
    judge=judge,
)

print(uq.score("What is the capital of France?"))

Demo 2 — Verbalized Confidence (Xiong et al.)

Llama-3.1-8B as target. No API keys needed — the model verbalizes its own confidence.

from omniuq import load_llm_model
from omniuq.verbalized_xiong import run_xiong

tokenizer, model = load_llm_model(
    "meta-llama/Llama-3.1-8B-Instruct"
)
device = model.device

# Strongest configuration: CoT + Self-Random M=5 + AvgConf
result = run_xiong(
    model, tokenizer,
    "If 7 cars need 14 hours to complete a task, "
    "how long do 5 cars need?",
    device,
    prompting="cot",
    n_samples=5,
    aggregation="avg_conf",
)
print(result["answer"], result["confidence"])

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

May 10, 2026

0.2.0

May 8, 2026

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omniuq-0.3.0.tar.gz (24.6 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omniuq-0.3.0-py3-none-any.whl (22.7 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file omniuq-0.3.0.tar.gz.

File metadata

Download URL: omniuq-0.3.0.tar.gz
Upload date: May 10, 2026
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for omniuq-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`d1635529f60fa7f75eda592959bf90fd899e804487ff2ca486ba6cc017693db5`
MD5	`17b3c2c86b4c69495a7ffa6f63312481`
BLAKE2b-256	`aecaefc43dcb89fc47485a159dce81480f6542524d7aaf5591589f51c3ddf69d`

See more details on using hashes here.

File details

Details for the file omniuq-0.3.0-py3-none-any.whl.

File metadata

Download URL: omniuq-0.3.0-py3-none-any.whl
Upload date: May 10, 2026
Size: 22.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for omniuq-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c77af5d27ae0d806085dc46675bdefba9551387215b0d34be496d5a770a9c2b4`
MD5	`2f9bfe20f379e96bcf2f3b0fb844ec0a`
BLAKE2b-256	`7f6713e29c52bdb8ae14ff96869f8b978bdc729344129653993994a1bcc6fa50`

See more details on using hashes here.

omniuq 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

omniuq

Install

The big picture: LLM Reliability

Categories of UQ Methods

Methods

Demo 1 — Spectral Uncertainty

Demo 2 — Verbalized Confidence (Xiong et al.)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes