A universal metric for Generative Large Language Models (GLLMs)
Project description
ANLS ★
🌟 A Universal Metric for Generative Large Language Models 🌟
@misc{anls_star,
title={ANLS* -- A Universal Document Processing Metric for Generative Large Language Models},
author={David Peer and Philemon Schöpf and Volckmar Nebendahl and Alexander Rietzler and Sebastian Stabinger},
year={2024},
eprint={2402.03848},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
How to use the ANLS* score?
pip install anls_star
- Add to your code
from anls_star import anls_score
anls = anls_score("Hello World", "Hello Wrld")
print(anls)
- Thats it!
Supported Types
Simply copy this file to your project and import the anls_score
function from it. Then call the function with the ground truth and the predictions.
The following types (and all combinations of it) are supported:
String
: To compare strings against each other using the normalized Levenshtein similarity.None
: Sometimes questions are not answerable. With this type it can be checked, whether the model does not answer. Any answer other than None will be penalized.Tuple
: Compare the given answer with each element in the tuple and select the element that produces the maximum ANLS* score. This is also provided by the classical ANLS metric.List
: Sometimes it is required to information in the form of lists from a document. For example, extracting all purchased items found in an invoice. While the order is not important, the list should contain all items. Note that the same item can occur multiple times in lists. Hungarian matching is used to compare the ground truth and the predicted list against each other. Both missing elements as well as hallucinated elements are penalized as previously introduced.Dict
: For document information extraction it is usually required to extract key-value pairs. For example, when extracting the date and total value from an invoice. Missing keys as well as hallucinated keys are penalized.
Benchmarks
The following table shows the ANLS* score for the different models and prompt methods on different datasets. Note that we evaluate the models and prompt methods on 100 samples for single page datasets and 20 samples for multi page datasets in order to reduce the execution time and costs. Note that the provided validation set is used for the report.
Dataset | Method | gpt-3.5-turbo-16k | gpt-4-turbo | gpt-4-vision | gemini-pro | mistral-large | claude-3 |
---|---|---|---|---|---|---|---|
DocVQA | Simple | 0.586 | 0.607 | 0.759 | 0.586 | 0.445 | 0.768 |
Latin Prompting | 0.659 | 0.699 | - | 0.676 | 0.447 | 0.762 | |
SFT (Ours) | 0.809 | 0.790 | - | 0.741 | 0.648 | 0.831 | |
MPDocVQA | Simple | 0.517 | 0.635 | 0.708 | 0.603 | 0.364 | 0.636 |
Latin Prompting | 0.499 | 0.739 | - | 0.502 | 0.335 | 0.438 | |
SFT (Ours) | 0.734 | 0.781 | - | 0.616 | 0.476 | 0.575 | |
Kleister Charity | Simple | 0.490 | 0.743 | 0.751 | 0.583 | 0.652 | 0.800 |
Latin Prompting | 0.442 | 0.735 | - | 0.478 | 0.576 | 0.787 | |
SFT (Ours) | 0.476 | 0.763 | - | 0.633 | 0.657 | 0.786 | |
Kleister NDA | Simple | 0.343 | 0.695 | 0.664 | 0.623 | 0.637 | 0.673 |
Latin Prompting | 0.434 | 0.705 | - | 0.599 | 0.624 | 0.67 | |
SFT (Ours) | 0.355 | 0.703 | - | 0.552 | 0.641 | 0.677 | |
SROIE | Simple | 0.874 | 0.835 | 0.834 | 0.263 | 0.855 | 0.933 |
Latin Prompting | 0.849 | 0.851 | - | 0.371 | 0.863 | 0.926 | |
SFT (Ours) | 0.893 | 0.873 | - | 0.288 | 0.905 | 0.949 | |
VRDU AD Buy | Simple | 0.402 | 0.553 | 0.640 | 0.510 | 0.386 | 0.577 |
Latin Prompting | 0.389 | 0.586 | - | 0.556 | 0.435 | 0.608 | |
SFT (Ours) | 0.661 | 0.770 | - | 0.685 | 0.594 | 0.633 | |
VRDU Registration | Simple | 0.659 | 0.676 | 0.665 | 0.699 | 0.579 | 0.685 |
Latin Prompting | 0.693 | 0.673 | - | 0.740 | 0.587 | 0.715 | |
SFT (Ours) | 0.723 | 0.711 | - | 0.720 | 0.639 | 0.705 |
How To Execute
- Install all dependencies via
pip install -r requirements_dev.txt
- Setup the keys
- OpenAI: Ensure that your OpenAI API key is set as environment variable
OPENAI_API_KEY
. - Gemini: Ensure that your VertexAI setup is correct in case you wanna benchmark gemini-pro too.
- Mistral: Setup the
MISTRAL_API_KEY
env variable as well asMISTRAL_ENDPOINT
(Azure) - Anthropic: Setup the
ANTHROPIC_API_KEY
env variable
- Download all datasets - the download link is provided when executing the benchmark script for the first time. Please note that the
datasets
folder should be on the same level as the repository folder. - Execute the corresponding benchmark script. For example:
python3 src/benchmark_doc_vqa.py "gpt-3.5-turbo-16k" "simple"
The following models are benchmarked:
gpt-3.5-turbo-16k
(Versiongpt-3.5-turbo-16k-0613
)gpt-4-turbo
(Versiongpt-4-1106-preview
)gemini-pro
(Version 1.0)mistral-large
(Version 03/2024)claude-3
(Versionclaude-3-opus-20240229
)gpt-4-vision-preview
(Versiongpt-4-1106-vision-preview
)
The following prompt methods are supported:
simple
- Simple text concatenation after OCR with GooleOCRlatin
- Method as introduced by Wang et al.sft
- DeepOpinion internal onlyvision
- If images should directly be used. Requires a model with vision capabilities e.g. gpt-4-vision
- The final ANLS* is shown on the console.
How to Execute all Unit Tests
To run all unit tests simply execute pytest
Packaging
See https://packaging.python.org/en/latest/tutorials/packaging-projects/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file anls_star-0.0.5.tar.gz
.
File metadata
- Download URL: anls_star-0.0.5.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b32efe803461c8712cffda50d9e764ebdd14d0757c8148f266ede72248b9a2f |
|
MD5 | 8410b3f1c89eb171ad810761f7d3f99c |
|
BLAKE2b-256 | 8da4484bbfa61e773ac0f3942658c5a2f039543c5e1364e28ea01c1864e686c4 |
File details
Details for the file anls_star-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: anls_star-0.0.5-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee6f55bb370c51ca9a9ec6681eb122e519e6ac372e82383895a21484ea105749 |
|
MD5 | 9fc6025f1b0db856d0830075aa425763 |
|
BLAKE2b-256 | 79d9afe3855473a80cc38aa0260728c5baa856edfb32d49b2c413d3d94520400 |