Skip to main content

Uncertainty Estimation Toolkit for Transformer Language Models

Project description

License: MIT Python 3.10

LM-Polygraph: Uncertainty estimation for LLMs

Installation | Basic usage | Overview | Benchmark | Demo application | Documentation

LM-Polygraph provides a battery of state-of-the-art of uncertainty estimation (UE) methods for LMs in text generation tasks. High uncertainty can indicate the presence of hallucinations and knowing a score that estimates uncertinaty can help to make applications of LLMs safer.

The framework also introduces an extendable benchmark for consistent evaluation of UE techniques by researchers and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses.

Installation

From GitHub

To install latest from main brach, clone the repo and conduct installation using pip, it is recommended to use virtual environment. Code example is presented below:

git clone https://github.com/IINemo/lm-polygraph.git
python3 -m venv env # Substitute this with your virtual environment creation command
source env/bin/activate
cd lm-polygraph
pip install .

Installation from GitHub is recommended if you want to explore notebooks with examples or use default benchmarking configurations, as they are included in the repository but not in the PyPI package. However code from the main branch may be unstable, so it is recommended to checkout to the latest stable release before installation:

git clone https://github.com/IINemo/lm-polygraph.git
git checkout tags/v0.3.0
python3 -m venv env # Substitute this with your virtual environment creation command
source env/bin/activate
cd lm-polygraph
pip install .

From PyPI

To install the latest stable version from PyPI, run:

python3 -m venv env # Substitute this with your virtual environment creation command
source env/bin/activate
pip install lm-polygraph

To install a specific version, run:

python3 -m venv env # Substitute this with your virtual environment creation command
source env/bin/activate
pip install lm-polygraph==0.3.0

Basic usage

  1. Initialize the base model (encoder-decoder or decoder-only) and tokenizer from HuggingFace or a local file, and use them to initialize the WhiteboxModel for evaluation. For example, with bigscience/bloomz-560m:
from transformers import AutoModelForCausalLM, AutoTokenizer
from lm_polygraph.utils.model import WhiteboxModel

model_path = "bigscience/bloomz-560m"
base_model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = WhiteboxModel(base_model, tokenizer, model_path=model_path)

Alternatively, you can use WhiteboxModel#from_pretrained method to let LM-Polygraph download the model and tokenizer for you. However, this approach is deprecated and will be removed in the next major release.

from lm_polygraph.utils.model import WhiteboxModel

model = WhiteboxModel.from_pretrained(
    "bigscience/bloomz-3b",
    device_map="cuda:0",
)
  1. Specify UE method:
from lm_polygraph.estimators import *

ue_method = MeanPointwiseMutualInformation()
  1. Get predictions and their uncertainty scores:
from lm_polygraph.utils.manager import estimate_uncertainty

input_text = "Who is George Bush?"
ue = estimate_uncertainty(model, ue_method, input_text=input_text)
print(ue)
# UncertaintyOutput(uncertainty=-6.504108926902215, input_text='Who is George Bush?', generation_text=' President of the United States', model_path='bigscience/bloomz-560m')

Other examples:

  • example.ipynb: simple examples of scoring individual queries;
  • claim_level_example.ipynb: an example of scoring individual claims;
  • qa_example.ipynb: an example of scoring the bigscience/bloomz-3b model on the TriviaQA dataset;
  • mt_example.ipynb: an of scoring the facebook/wmt19-en-de model on the WMT14 En-De dataset;
  • ats_example.ipynb: an example of scoring the facebook/bart-large-cnn model on the XSUM summarization dataset;
  • colab: demo web application in Colab (bloomz-560m, gpt-3.5-turbo, and gpt-4 fit the default memory limit; other models require Colab-pro).

Overview of methods

Uncertainty Estimation Method Type Category Compute Memory Need Training Data? Level
Maximum sequence probability White-box Information-based Low Low No sequence/claim
Perplexity (Fomicheva et al., 2020a) White-box Information-based Low Low No sequence/claim
Mean/max token entropy (Fomicheva et al., 2020a) White-box Information-based Low Low No sequence/claim
Monte Carlo sequence entropy (Kuhn et al., 2023) White-box Information-based High Low No sequence
Pointwise mutual information (PMI) (Takayama and Arase, 2019) White-box Information-based Medium Low No sequence/claim
Conditional PMI (van der Poel et al., 2022) White-box Information-based Medium Medium No sequence
Rényi divergence (Darrin et al., 2023) White-box Information-based Low Low No sequence
Fisher-Rao distance (Darrin et al., 2023) White-box Information-based Low Low No sequence
Semantic entropy (Kuhn et al., 2023) White-box Meaning diversity High Low No sequence
Claim-Conditioned Probability (Fadeeva et al., 2024) White-box Meaning diversity Low Low No sequence/claim
TokenSAR (Duan et al., 2023) White-box Meaning diversity High Low No sequence
SentenceSAR (Duan et al., 2023) White-box Meaning diversity High Low No sequence
SAR (Duan et al., 2023) White-box Meaning diversity High Low No sequence
Sentence-level ensemble-based measures (Malinin and Gales, 2020) White-box Ensembling High High Yes sequence
Token-level ensemble-based measures (Malinin and Gales, 2020) White-box Ensembling High High Yes sequence
Mahalanobis distance (MD) (Lee et al., 2018) White-box Density-based Low Low Yes sequence
Robust density estimation (RDE) (Yoo et al., 2022) White-box Density-based Low Low Yes sequence
Relative Mahalanobis distance (RMD) (Ren et al., 2023) White-box Density-based Low Low Yes sequence
Hybrid Uncertainty Quantification (HUQ) (Vazhentsev et al., 2023a) White-box Density-based Low Low Yes sequence
p(True) (Kadavath et al., 2022) White-box Reflexive Medium Low No sequence/claim
Number of semantic sets (NumSets) (Lin et al., 2023) Black-box Meaning Diversity High Low No sequence
Sum of eigenvalues of the graph Laplacian (EigV) (Lin et al., 2023) Black-box Meaning Diversity High Low No sequence
Degree matrix (Deg) (Lin et al., 2023) Black-box Meaning Diversity High Low No sequence
Eccentricity (Ecc) (Lin et al., 2023) Black-box Meaning Diversity High Low No sequence
Lexical similarity (LexSim) (Fomicheva et al., 2020a) Black-box Meaning Diversity High Low No sequence
Verbalized Uncertainty 1S (Tian et al., 2023) Black-box Reflexive Low Low No sequence
Verbalized Uncertainty 2S (Tian et al., 2023) Black-box Reflexive Medium Low No sequence

Benchmark

To evaluate the performance of uncertainty estimation methods consider a quick example:

HYDRA_CONFIG=../examples/configs/polygraph_eval_coqa.yaml python ./scripts/polygraph_eval \
    dataset="coqa" \
    model.path="databricks/dolly-v2-3b" \
    save_path="./workdir/output" \
    "seed=[1,2,3,4,5]"

Use visualization_tables.ipynb or result_tables.ipynb to generate the summarizing tables for an experiment.

A detailed description of the benchmark is in the documentation.

Demo web application

gui7

Start with Docker

docker run -p 3001:3001 -it \
    -v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
    --gpus all mephodybro/polygraph_demo:0.0.17 polygraph_server

The server should be available on http://localhost:3001

A more detailed description of the demo is available in the documentation.

Cite

@inproceedings{fadeeva-etal-2023-lm,
    title = "{LM}-Polygraph: Uncertainty Estimation for Language Models",
    author = "Fadeeva, Ekaterina  and
      Vashurin, Roman  and
      Tsvigun, Akim  and
      Vazhentsev, Artem  and
      Petrakov, Sergey  and
      Fedyanin, Kirill  and
      Vasilev, Daniil  and
      Goncharova, Elizaveta  and
      Panchenko, Alexander  and
      Panov, Maxim  and
      Baldwin, Timothy  and
      Shelmanov, Artem",
    editor = "Feng, Yansong  and
      Lefever, Els",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-demo.41",
    doi = "10.18653/v1/2023.emnlp-demo.41",
    pages = "446--461",
    abstract = "Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often {``}hallucinate{''}, i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.",
}

Acknowledgements

The chat GUI implementation is based on the chatgpt-web-application project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm_polygraph-0.4.0.tar.gz (124.3 kB view details)

Uploaded Source

Built Distribution

lm_polygraph-0.4.0-py3-none-any.whl (178.3 kB view details)

Uploaded Python 3

File details

Details for the file lm_polygraph-0.4.0.tar.gz.

File metadata

  • Download URL: lm_polygraph-0.4.0.tar.gz
  • Upload date:
  • Size: 124.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.9

File hashes

Hashes for lm_polygraph-0.4.0.tar.gz
Algorithm Hash digest
SHA256 50e5b610043258e0c86b0e60eacf27c16a56096448595afb3a27ca7aef6796b7
MD5 214260f429225b0665d3e6e0f21ab1b6
BLAKE2b-256 617e299a3c3c4f084172a43e3d71366198dd3e1e5c0b20eb1158c9db6639f709

See more details on using hashes here.

File details

Details for the file lm_polygraph-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: lm_polygraph-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 178.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.9

File hashes

Hashes for lm_polygraph-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 893f90f3cbdac1b45f4c449887187dfef8e0bba29a5e8a6068c632bd7e2acaf8
MD5 6435bff646b70f7ee4461352c8c46085
BLAKE2b-256 f1b200dc9201db0c25030ab9dddd5fbbf2e1b979b5ad54460ddd3765363b3132

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page