Uncertainty Estimation Toolkit for Transformer Language Models
Project description
LM-Polygraph: Uncertainty estimation for LLMs
Installation | Basic usage | Overview | Benchmark | Demo application | Documentation
LM-Polygraph provides a battery of state-of-the-art of uncertainty estimation (UE) methods for LMs in text generation tasks. High uncertainty can indicate the presence of hallucinations and knowing a score that estimates uncertinaty can help to make applications of LLMs safer.
The framework also introduces an extendable benchmark for consistent evaluation of UE techniques by researchers and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses.
Installation
git clone https://github.com/IINemo/lm-polygraph.git && cd lm-polygraph && pip install .
Basic usage
- Initialize the model (encoder-decoder or decoder-only) from HuggingFace or a local file. For example,
bigscience/bloomz-3b
from lm_polygraph.utils.model import WhiteboxModel
model = WhiteboxModel.from_pretrained(
"bigscience/bloomz-3b",
device="cuda:0",
)
- Specify UE method
from lm_polygraph.estimators import *
ue_method = MeanPointwiseMutualInformation()
- Get predictions and their uncertainty scores
from lm_polygraph.utils.manager import estimate_uncertainty
input_text = "Who is George Bush?"
estimate_uncertainty(model, ue_method, input_text=input_text)
Other examples:
- example.ipynb: simple examples of scoring individual queries;
- qa_example.ipynb: an example of scoring the
bigscience/bloomz-3b
model on theTriviaQA
dataset; - mt_example.ipynb: an of scoring the
facebook/wmt19-en-de
model on theWMT14 En-De
dataset; - ats_example.ipynb: an example of scoring the
facebook/bart-large-cnn
model on theXSUM
summarization dataset; - colab: demo web application in Colab (
bloomz-560m
,gpt-3.5-turbo
, andgpt-4
fit the default memory limit; other models require Colab-pro).
Overview of methods
Uncertainty Estimation Method | Type | Category | Compute | Memory | Need Training Data? |
---|---|---|---|---|---|
Maximum sequence probability | White-box | Information-based | Low | Low | No |
Perplexity (Fomicheva et al., 2020a) | White-box | Information-based | Low | Low | No |
Mean token entropy (Fomicheva et al., 2020a) | White-box | Information-based | Low | Low | No |
Monte Carlo sequence entropy (Kuhn et al., 2023) | White-box | Information-based | High | Low | No |
Pointwise mutual information (PMI) (Takayama and Arase, 2019) | White-box | Information-based | Medium | Low | No |
Conditional PMI (van der Poel et al., 2022) | White-box | Information-based | Medium | Medium | No |
Semantic entropy (Kuhn et al., 2023) | White-box | Meaning diversity | High | Low | No |
Sentence-level ensemble-based measures (Malinin and Gales, 2020) | White-box | Ensembling | High | High | Yes |
Token-level ensemble-based measures (Malinin and Gales, 2020) | White-box | Ensembling | High | High | Yes |
Mahalanobis distance (MD) (Lee et al., 2018) | White-box | Density-based | Low | Low | Yes |
Robust density estimation (RDE) (Yoo et al., 2022) | White-box | Density-based | Low | Low | Yes |
Relative Mahalanobis distance (RMD) (Ren et al., 2023) | White-box | Density-based | Low | Low | Yes |
Hybrid Uncertainty Quantification (HUQ) (Vazhentsev et al., 2023a) | White-box | Density-based | Low | Low | Yes |
p(True) (Kadavath et al., 2022) | White-box | Reflexive | Medium | Low | No |
Number of semantic sets (NumSets) (Kuhn et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
Sum of eigenvalues of the graph Laplacian (EigV) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
Degree matrix (Deg) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
Eccentricity (Ecc) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
Lexical similarity (LexSim) (Fomicheva et al., 2020a) | Black-box | Meaning Diversity | High | Low | No |
Benchmark
To evaluate the performance of uncertainty estimation methods consider a quick example:
HYDRA_CONFIG=../configs/polygraph_eval/polygraph_eval.yaml python ./scripts/polygraph_eval \
dataset="./workdir/data/triviaqa.csv" \
model="databricks/dolly-v2-3b" \
save_path="./workdir/output" \
seed=[1,2,3,4,5]
Use visualization_tables.ipynb
to generate the summarizing tables for an experiment.
A detailed description of the benchmark is in the documentation.
Demo web application
Start with Docker
docker run -p 3001:3001 -it \
-v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
--gpus all mephodybro/polygraph_demo:0.0.17 polygraph_server
The server should be available on http://localhost:3001
A more detailed description of the demo is available in the documentation.
Acknowledgements
The chat GUI implementation is based on the chatgpt-web-application project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lm_polygraph-0.0.1.dev0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ee0922f295acc023bf4e6f7e6b61fdd3cb806f98826786e5439d99c067fd1ea |
|
MD5 | 3b95b4f6c7f0846bf6dcb9bc2321d58f |
|
BLAKE2b-256 | c386d8a3c6c100e4b661a57d628e3024928ba25be9dfe0d59326866963002a83 |