UQLM (Uncertainty Quantification for Language Models) is a Python package for UQ-based LLM hallucination detection.

These details have not been verified by PyPI

Project links

Project description

uqlm: Uncertainty Quantification for Language Models

UQLM is a Python library for Large Language Model (LLM) hallucination detection using state-of-the-art uncertainty quantification techniques.

Installation

The latest version can be installed from PyPI:

pip install uqlm

Hallucination Detection

UQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations. We categorize these scorers into four main types:

Scorer Type	Added Latency	Added Cost	Compatibility	Off-the-Shelf / Effort
Black-Box Scorers	⏱️ Medium-High (multiple generations & comparisons)	💸 High (multiple LLM calls)	🌍 Universal (works with any LLM)	✅ Off-the-shelf
White-Box Scorers	⚡ Minimal* (token probabilities already returned)	✔️ None* (no extra LLM calls)	🔒 Limited (requires access to token probabilities)	✅ Off-the-shelf
LLM-as-a-Judge Scorers	⏳ Low-Medium (additional judge calls add latency)	💵 Low-High (depends on number of judges)	🌍 Universal (any LLM can serve as judge)	✅ Off-the-shelf
Ensemble Scorers	🔀 Flexible (combines various scorers)	🔀 Flexible (combines various scorers)	🔀 Flexible (combines various scorers)	✅ Off-the-shelf (beginner-friendly); 🛠️ Can be tuned (best for advanced users)

^{^{*Does not apply to multi-generation white-box scorers, which have higher cost and latency.}}

Below we provide illustrative code snippets and details about available scorers for each type.

Black-Box Scorers (Consistency-Based)

These scorers assess uncertainty by measuring the consistency of multiple responses generated from the same prompt. They are compatible with any LLM, intuitive to use, and don't require access to internal model states or token probabilities.

Example Usage: Below is a sample of code illustrating how to use the BlackBoxUQ class to conduct hallucination detection.

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

from uqlm import BlackBoxUQ
bbuq = BlackBoxUQ(llm=llm, scorers=["semantic_negentropy"], use_best=True)

results = await bbuq.generate_and_score(prompts=prompts, num_responses=5)
results.to_df()

Above, use_best=True implements mitigation so that the uncertainty-minimized responses is selected. Note that although we use ChatOpenAI in this example, any LangChain Chat Model may be used. For a more detailed demo, refer to our Black-Box UQ Demo.

Available Scorers:

Discrete Semantic Entropy (Farquhar et al., 2024; Bouchard & Chauhan, 2025)
Number of Semantic Sets (Lin et al., 2024; Vashurin et al., 2025; Kuhn et al., 2023)
Non-Contradiction Probability (Chen & Mueller, 2023; Lin et al., 2024; Manakul et al., 2023)
Entailment Probability (Chen & Mueller, 2023; Lin et al., 2024; Manakul et al., 2023)
Exact Match (Cole et al., 2023; Chen & Mueller, 2023)
BERTScore (Manakul et al., 2023; Zheng et al., 2020)
Cosine Similarity (Shorinwa et al., 2024; HuggingFace)

White-Box Scorers (Token-Probability-Based)

These scorers leverage token probabilities to estimate uncertainty. They offer single-generation scoring, which is significantly faster and cheaper than black-box methods, but require access to the LLM's internal probabilities, meaning they are not necessarily compatible with all LLMs/APIs.

Example Usage: Below is a sample of code illustrating how to use the WhiteBoxUQ class to conduct hallucination detection.

from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model='gemini-2.5-pro')

from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])

results = await wbuq.generate_and_score(prompts=prompts)
results.to_df()

Again, any LangChain Chat Model may be used in place of ChatVertexAI. For more detailed examples, refer to our demo notebooks on Single-Generation White-Box UQ and/or Multi-Generation White-Box UQ.

Single-Generation Scorers (minimal latency, zero extra cost):

Minimum token probability (Manakul et al., 2023)
Length-Normalized Sequence Probability (Malinin & Gales, 2021)
Sequence Probability (Vashurin et al., 2024)
Mean Top-K Token Negentropy (Scalena et al., 2025; Manakul et al., 2023)
Min Top-K Token Negentropy (Scalena et al., 2025; Manakul et al., 2023)
Probability Margin (Farr et al., 2024)

Self-Reflection Scorers (one additional generation per response):

P(True) (Kadavath et al., 2022)

Multi-Generation Scorers (several additional generations per response):

Monte carlo sequence probability (Kuhn et al., 2023)
Consistency and Confidence (CoCoA) (Vashurin et al., 2025)
Semantic Entropy (Farquhar et al., 2024)
Semantic Density (Qiu et al., 2024)

LLM-as-a-Judge Scorers

These scorers use one or more LLMs to evaluate the reliability of the original LLM's response. They offer high customizability through prompt engineering and the choice of judge LLM(s).

Example Usage: Below is a sample of code illustrating how to use the LLMPanel class to conduct hallucination detection using a panel of LLM judges.

from langchain_ollama import ChatOllama
llama = ChatOllama(model="llama3")
mistral = ChatOllama(model="mistral")
qwen = ChatOllama(model="qwen3")

from uqlm import LLMPanel
panel = LLMPanel(llm=llama, judges=[llama, mistral, qwen])

results = await panel.generate_and_score(prompts=prompts)
results.to_df()

Note that although we use ChatOllama in this example, we can use any LangChain Chat Model as judges. For a more detailed demo illustrating how to customize a panel of LLM judges, refer to our LLM-as-a-Judge Demo.

Available Scorers:

Categorical LLM-as-a-Judge (Manakul et al., 2023; Chen & Mueller, 2023; Luo et al., 2023)
Continuous LLM-as-a-Judge (Xiong et al., 2024)
Likert Scale LLM-as-a-Judge (Bai et al., 2023)
Panel of LLM Judges (Verga et al., 2024)

Ensemble Scorers

These scorers leverage a weighted average of multiple individual scorers to provide a more robust uncertainty/confidence estimate. They offer high flexibility and customizability, allowing you to tailor the ensemble to specific use cases.

Example Usage: Below is a sample of code illustrating how to use the UQEnsemble class to conduct hallucination detection.

from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(deployment_name="gpt-4o", openai_api_type="azure", openai_api_version="2024-12-01-preview")

from uqlm import UQEnsemble
## ---Option 1: Off-the-Shelf Ensemble---
# uqe = UQEnsemble(llm=llm)
# results = await uqe.generate_and_score(prompts=prompts, num_responses=5)

## ---Option 2: Tuned Ensemble---
scorers = [ # specify which scorers to include
    "exact_match", "noncontradiction", # black-box scorers
    "min_probability", # white-box scorer
    llm # use same LLM as a judge
]
uqe = UQEnsemble(llm=llm, scorers=scorers)

# Tune on tuning prompts with provided ground truth answers
tune_results = await uqe.tune(
    prompts=tuning_prompts, ground_truth_answers=ground_truth_answers
)
# ensemble is now tuned - generate responses on new prompts
results = await uqe.generate_and_score(prompts=prompts)
results.to_df()

As with the other examples, any LangChain Chat Model may be used in place of AzureChatOpenAI. For more detailed demos, refer to our Off-the-Shelf Ensemble Demo (quick start) or our Ensemble Tuning Demo (advanced).

Available Scorers:

BS Detector (Chen & Mueller, 2023)
Generalized UQ Ensemble (Bouchard & Chauhan, 2025)

Documentation

Check out our documentation site for detailed instructions on using this package, including API reference and more.

Example notebooks

UQLM offers a broad collection of tutorial notebooks to demonstrate usage of the various scorers. These notebooks aim to have versatile coverage of various LLMs and datasets, but you can easily replace them with your LLM and dataset of choice. Below is a list of these tutorials:

Black-Box Uncertainty Quantification: A notebook demonstrating hallucination detection with black-box (consistency) scorers.
White-Box Uncertainty Quantification (Single-Generation): A notebook demonstrating hallucination detection with white-box (token probability-based) scorers requiring only a single generation per response (fastest and cheapest).
White-Box Uncertainty Quantification (Multi-Generation): A notebook demonstrating hallucination detection with white-box (token probability-based) scorers requiring multiple generations per response (slower and more expensive, but higher performance).
LLM-as-a-Judge: A notebook demonstrating hallucination detection with LLM-as-a-Judge.
Tunable UQ Ensemble: A notebook demonstrating hallucination detection with a tunable ensemble of UQ scorers (Bouchard & Chauhan, 2025).
Off-the-Shelf UQ Ensemble: A notebook demonstrating hallucination detection using BS Detector (Chen & Mueller, 2023) off-the-shelf ensemble.
Semantic Entropy: A notebook demonstrating token-probability-based semantic entropy (Farquhar et al., 2024; Kuhn et al., 2023), a state-of-the-art multi-generation white-box scorer.
Semantic Density: A notebook demonstrating semantic density Semantic Density (Qiu et al., 2024)), a state-of-the-art multi-generation white-box scorer.
Multimodal Uncertainty Quantification: A notebook demonstrating UQLM's scoring approach with multimodal inputs (compatible with black-box UQ and white-box UQ).
Score Calibration: A notebook illustrating transformation of confidence scores into calibrated probabilities that better reflect the true likelihood of correctness.

Citation

A technical description of the uqlm scorers and extensive experimental results are presented in this paper. If you use our framework or toolkit, please cite:

@misc{bouchard2025uncertaintyquantificationlanguagemodels,
      title={Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers}, 
      author={Dylan Bouchard and Mohit Singh Chauhan},
      year={2025},
      eprint={2504.19254},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.19254}, 
}

The uqlm software package is described in this this paper. If you use the software, please cite:

@misc{bouchard2025uqlmpythonpackageuncertainty,
      title={UQLM: A Python Package for Uncertainty Quantification in Large Language Models}, 
      author={Dylan Bouchard and Mohit Singh Chauhan and David Skarbrevik and Ho-Kyeong Ra and Viren Bajaj and Zeya Ahmad},
      year={2025},
      eprint={2507.06196},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.06196}, 
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.10

Apr 15, 2026

0.5.9

Apr 10, 2026

0.5.8

Mar 30, 2026

0.5.7

Mar 13, 2026

0.5.6

Mar 2, 2026

0.5.5

Feb 25, 2026

0.5.5a0 pre-release

Feb 25, 2026

0.5.4

Jan 30, 2026

0.5.3

Jan 20, 2026

0.5.2

Jan 14, 2026

0.5.1

Jan 9, 2026

0.5.0

Jan 8, 2026

0.5.0a1 pre-release

Jan 7, 2026

0.5.0a0 pre-release

Jan 7, 2026

0.4.5

Dec 8, 2025

0.4.4

Dec 4, 2025

This version

0.4.3

Dec 3, 2025

0.4.2

Nov 21, 2025

0.4.1

Nov 14, 2025

0.4.1a0 pre-release

Nov 13, 2025

0.4.0

Nov 12, 2025

0.4.0b1 pre-release

Nov 12, 2025

0.4.0b0 pre-release

Nov 12, 2025

0.4.0a3 pre-release

Nov 10, 2025

0.4.0a2 pre-release

Nov 10, 2025

0.4.0a1 pre-release

Nov 10, 2025

0.4.0a0 pre-release

Nov 10, 2025

0.3.1

Oct 23, 2025

0.3.0

Oct 1, 2025

0.2.7

Sep 12, 2025

0.2.6

Aug 28, 2025

0.2.5

Aug 25, 2025

0.2.4

Aug 13, 2025

0.2.3

Aug 5, 2025

0.2.2

Aug 1, 2025

0.2.1

Jul 28, 2025

0.2.0

Jul 25, 2025

0.2.0rc0 pre-release

Jul 25, 2025

0.2.0b4 pre-release

Jul 23, 2025

0.2.0b3 pre-release

Jul 22, 2025

0.2.0b2 pre-release

Jul 21, 2025

0.2.0b1 pre-release

Jul 21, 2025

0.2.0b0 pre-release

Jul 21, 2025

0.1.9

Jul 22, 2025

0.1.8

Jul 2, 2025

0.1.7

Jun 25, 2025

0.1.6

Jun 19, 2025

0.1.5

Jun 18, 2025

0.1.4

Jun 11, 2025

0.1.3

Jun 2, 2025

0.1.2

May 14, 2025

0.1.1

May 12, 2025

0.1.0

May 6, 2025

0.1.0rc0 pre-release

May 5, 2025

0.1.0b7 pre-release

May 5, 2025

0.1.0b6 pre-release yanked

May 5, 2025

0.1.0b5 pre-release yanked

May 5, 2025

0.1.0b4 pre-release yanked

May 5, 2025

0.1.0b3 pre-release yanked

May 5, 2025

0.1.0b2 pre-release yanked

May 5, 2025

0.1.0b1 pre-release yanked

May 5, 2025

0.1.0b0 pre-release yanked

May 4, 2025

0.1.0a1 pre-release yanked

May 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uqlm-0.4.3.tar.gz (67.3 kB view details)

Uploaded Dec 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uqlm-0.4.3-py3-none-any.whl (101.2 kB view details)

Uploaded Dec 3, 2025 Python 3

File details

Details for the file uqlm-0.4.3.tar.gz.

File metadata

Download URL: uqlm-0.4.3.tar.gz
Upload date: Dec 3, 2025
Size: 67.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.10.15 Linux/5.10.0-36-cloud-amd64

File hashes

Hashes for uqlm-0.4.3.tar.gz
Algorithm	Hash digest
SHA256	`fcf5c5965e9ee80525cf5c643fdca36fb199323f72c1f90fe6faa8cf04c8b4a5`
MD5	`0d0f13012ef2a2de1835e6db8b990aa2`
BLAKE2b-256	`81106df966d93bc774b4736dd02bb3ef339eea8308d26b53caa75a567615e1a5`

See more details on using hashes here.

File details

Details for the file uqlm-0.4.3-py3-none-any.whl.

File metadata

Download URL: uqlm-0.4.3-py3-none-any.whl
Upload date: Dec 3, 2025
Size: 101.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.10.15 Linux/5.10.0-36-cloud-amd64

File hashes

Hashes for uqlm-0.4.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb66cccb731c3a891482e9ae5b1b273187a670166267bade0bb8ae1226e0af09`
MD5	`e22a77c4ffb5e44975acaeadba13adf1`
BLAKE2b-256	`29cd88f02bb742cce88d383040b774f5b59542e94bad520a37808f1e55303280`

See more details on using hashes here.

uqlm 0.4.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

uqlm: Uncertainty Quantification for Language Models

Installation

Hallucination Detection

Black-Box Scorers (Consistency-Based)

White-Box Scorers (Token-Probability-Based)

LLM-as-a-Judge Scorers

Ensemble Scorers

Documentation

Example notebooks

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes