Skip to main content

LLMSanitize: a package to detect contamination in LLMs

Project description

LLMSanitize

An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).

Installation

The library has been designed and tested with Python 3.9 and CUDA 11.8.

First make sure you have CUDA 11.8 installed, and create a conda environment with Python 3.9:

conda create --name llmsanitize python=3.9

Next activate the environment:

conda activate llmsanitize

Then install LLMSanitize from PyPI:

pip install llmsanitize

Notably, we use vllm 0.3.3.

Supported Methods

The repository supports all the following contamination detection methods:

Method Use Case Method Type Model Access Reference
gpt-2 Open-data String Matching _ Language Models are Unsupervised Multitask Learners (link), Section 4
gpt-3 Open-data String Matching _ Language Models are Few-Shot Learners (link), Section 4
exact Open-data String Matching _ Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (link), Section 4.2
palm Open-data String Matching _ PaLM: Scaling Language Modeling with Pathways (link), Sections 7-8
gpt-4 Open-data String Matching _ GPT-4 Technical Report (link), Appendix C
platypus Open-data Embeddings Similarity _ Platypus: Quick, Cheap, and Powerful Refinement of LLMs (link), Section 2.3
guided-prompting Closed-data Prompt Engineering/LLM-based Black-box Time Travel in LLMs: Tracing Data Contamination in Large Language Models (link)
sharded-likelihood Closed-data Model Likelihood White-box Proving Test Set Contamination in Black-box Language Models (link)
min-prob Closed-data Model Likelihood White-box Detecting Pretraining Data from Large Language Models (link)
cdd Closed-data Model Memorization/Model Likelihood Black-box Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models (link), Section 3.2
ts-guessing-question-based Closed-data Model Completion Black-box Investigating Data Contamination in Modern Benchmarks for Large Language Models (link), Section 3.2.1
ts-guessing-question-multichoice Closed-data Model Completion Black-box Investigating Data Contamination in Modern Benchmarks for Large Language Models (link), Section 3.2.2

vLLM

The following methods require to launch a vLLM instance which will handle model inference:

Method
guided-prompting
min-prob
cdd
ts-guessing-question-based
ts-guessing-question-multichoice

To launch the instance, first run the following command in a terminal:

sh llmsanitize/scripts/vllm_hosting.sh

You are required to specify a port number and model name in this shell script.

Run Contamination Detection

To run contamination detection, you can follow the multiple test scripts in scripts/tests/ folder.

For instance, to run sharded-likelihood on Hellaswag with Llama-2-7B:

sh llmsanitizescripts/tests/closed_data/sharded-likelihood/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder> 

To run a method using vLLM like guided-prompting for instance, the only difference is to pass the port number as argument:

sh llmsanitizescripts/tests/closed_data/guided-prompting/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder> -p <port_number_from_your_vllm_instance>

Or, since llmsanitize has been installed as a Python package, you can call the detection methods directly in your Python script:

from llmsanitize ClosedDataContaminationChecker
args = <setup your argparse here>
contamination_checker = ClosedDataContaminationChecker(args)
contamination_checker.run_contamination("guided-prompting") # make sure that your args contain all parameters relevant this specific method

Citation

If you find our paper or this project helps your research, please kindly consider citing our paper in your publication.

@article{ravaut2024much,
  title={How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library},
  author={Ravaut, Mathieu and Ding, Bosheng and Jiao, Fangkai and Chen, Hailin and Li, Xingxuan and Zhao, Ruochen and Qin, Chengwei and Xiong, Caiming and Joty, Shafiq},
  journal={arXiv preprint arXiv:2404.00699},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmsanitize-0.0.8.tar.gz (32.4 kB view details)

Uploaded Source

Built Distribution

llmsanitize-0.0.8-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file llmsanitize-0.0.8.tar.gz.

File metadata

  • Download URL: llmsanitize-0.0.8.tar.gz
  • Upload date:
  • Size: 32.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for llmsanitize-0.0.8.tar.gz
Algorithm Hash digest
SHA256 9d3bc2051b118b5da7cee378c1fae07c918c3e2b27e2c356e65c8e39bf5383ea
MD5 6dd9d7c4f15d7e05b10b1f5b45dfd885
BLAKE2b-256 8ae4da7daf18bfba961d0f85794b1eab6afb4da6d6c8a15eacb394859194b506

See more details on using hashes here.

File details

Details for the file llmsanitize-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: llmsanitize-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 44.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for llmsanitize-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 ffe3805e65ad1de9cbaddc656e23f1b6bb20b660baea870596f319a12e2d5785
MD5 7fe6b8f9262e8ccd75bd99e1e2d976bd
BLAKE2b-256 13d7026faf29a34d9225611e0523b4c584d17dbbd07b1bdc375caa593a5d165d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page