LLMSanitize: a package to detect contamination in LLMs
Project description
LLMSanitize
An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).
Installation
The library has been designed and tested with Python 3.9 and CUDA 11.8.
First make sure you have CUDA 11.8 installed, and create a conda environment with Python 3.9:
conda create --name llmsanitize python=3.9
Next activate the environment:
conda activate llmsanitize
Then install LLMSanitize from PyPI:
pip install llmsanitize
Notably, we use vllm 0.3.3.
Supported Methods
The repository supports all the following contamination detection methods:
Method | Use Case | Method Type | Model Access | Reference |
---|---|---|---|---|
gpt-2 | Open-data | String Matching | _ | Language Models are Unsupervised Multitask Learners (link), Section 4 |
gpt-3 | Open-data | String Matching | _ | Language Models are Few-Shot Learners (link), Section 4 |
exact | Open-data | String Matching | _ | Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (link), Section 4.2 |
palm | Open-data | String Matching | _ | PaLM: Scaling Language Modeling with Pathways (link), Sections 7-8 |
gpt-4 | Open-data | String Matching | _ | GPT-4 Technical Report (link), Appendix C |
platypus | Open-data | Embeddings Similarity | _ | Platypus: Quick, Cheap, and Powerful Refinement of LLMs (link), Section 2.3 |
guided-prompting | Closed-data | Prompt Engineering/LLM-based | Black-box | Time Travel in LLMs: Tracing Data Contamination in Large Language Models (link) |
sharded-likelihood | Closed-data | Model Likelihood | White-box | Proving Test Set Contamination in Black-box Language Models (link) |
min-prob | Closed-data | Model Likelihood | White-box | Detecting Pretraining Data from Large Language Models (link) |
cdd | Closed-data | Model Memorization/Model Likelihood | Black-box | Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models (link), Section 3.2 |
ts-guessing-question-based | Closed-data | Model Completion | Black-box | Investigating Data Contamination in Modern Benchmarks for Large Language Models (link), Section 3.2.1 |
ts-guessing-question-multichoice | Closed-data | Model Completion | Black-box | Investigating Data Contamination in Modern Benchmarks for Large Language Models (link), Section 3.2.2 |
vLLM
The following methods require to launch a vLLM instance which will handle model inference:
Method |
---|
guided-prompting |
min-prob |
cdd |
ts-guessing-question-based |
ts-guessing-question-multichoice |
To launch the instance, first run the following command in a terminal:
sh llmsanitize/scripts/vllm_hosting.sh
You are required to specify a port number and model name in this shell script.
Run Contamination Detection
To run contamination detection, follow the multiple test scripts in scripts/tests/ folder.
For instance, to run sharded-likelihood on Hellaswag with Llama-2-7B:
sh llmsanitizescripts/tests/closed_data/sharded-likelihood/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder>
To run a method using vLLM like guided-prompting for instance, the only difference is to pass the port number as argument:
sh llmsanitizescripts/tests/closed_data/guided-prompting/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder> -p <port_number_from_your_vllm_instance>
Citation
If you find our paper or this project helps your research, please kindly consider citing our paper in your publication.
@article{ravaut2024much,
title={How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library},
author={Ravaut, Mathieu and Ding, Bosheng and Jiao, Fangkai and Chen, Hailin and Li, Xingxuan and Zhao, Ruochen and Qin, Chengwei and Xiong, Caiming and Joty, Shafiq},
journal={arXiv preprint arXiv:2404.00699},
year={2024}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file llmsanitize-0.0.4.tar.gz
.
File metadata
- Download URL: llmsanitize-0.0.4.tar.gz
- Upload date:
- Size: 31.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47a3d2ebe91beda64300f601de4ac51dce5ba8cc7feee46a4ee1ea0c03dcdb30 |
|
MD5 | d5badea565224b8ff067f83683e08ea6 |
|
BLAKE2b-256 | e089f05d17715c50902f042bdf095ab7283b3e43aa48b1c2ca315c20e1ad9c15 |
File details
Details for the file llmsanitize-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: llmsanitize-0.0.4-py3-none-any.whl
- Upload date:
- Size: 43.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa75a798fcf3ce1ce11b2706873d1ad0ad7ef5edb200b6e6c5cda12232f8903f |
|
MD5 | bac102a230442cc9c4c6a796a6e2b215 |
|
BLAKE2b-256 | 9168bc5e11d6b58f58584b1d9a17bf1668c50e1136e3dcd3736bb68efd71e59d |