A Toolkit for automatic LLM Evaluation.
Project description
TAIL
📄 Documentation
See our full documentation at https://yale-nlp.github.io/TAIL/.
💡 Introduction
TAIL is an automatic toolkit for creating realistic evaluation benchmarks and assessing the performance of long-context LLMs. With TAIL, users can customize the building of a long-context, document-grounded QA benchmark and obtain visualized performance metrics of evaluated models.
🚀 Quickstart
-
install the package from PyPi:
# (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tailtest
set yout OPENAI_API_KEY:
export OPENAI_API_KEY="..."
-
Prepare a source document you want to use to generate benchmark and organize in the format of json.
[{"text": "Content of your document"}]
-
Benchmark Generation:
tail-cli.build --raw_document_path "/data/raw.json" --QA_save_path "/data/QA.json" --document_length 8000 32000 64000 --depth_list 25 50 75
-
Model Evaluation & Testing:
tail-cli.eval --QA_save_path "/data/QA.json" --test_model_name "gpt-4o" --test_depth_list 25 75 --test_doc_length 8000 32000 --test_result_save_dir "/data/result/"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tail_test-0.1.0.tar.gz
.
File metadata
- Download URL: tail_test-0.1.0.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c8e9889dfe2b5868a3217d5202fa481138c775afa914b2a6a55e25064ab5d97 |
|
MD5 | 7b042eb0f8d3999a5e348734a11dcf34 |
|
BLAKE2b-256 | d47117688269eec2e9f24cbb57c70d7e71a9b9f29dfe686177569566772808c8 |
File details
Details for the file tail_test-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: tail_test-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 345b7d58641a3bfe330a456d22a321cf5d4bdd25fdd9febad13b0a76a8ff43b8 |
|
MD5 | 0f7a2553ec3ed3c5857d51178f550005 |
|
BLAKE2b-256 | 85817ae4d8f509b7c91dd83427dff53232118b14488c36cb9d50b59da35d3293 |