A short description of your package

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Chameleon-BOD: Detecting Overfit in LLM Evaluations

Welcome to Chameleon-BOD â€“ a repository for our paper "Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon." This project provides a meta-evaluation framework for large language models (LLMs) that reveals whether a modelâ€™s performance is based on memorized prompt patterns rather than genuine language understanding.

Abstract

Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a modelâ€™s performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BODâ€™s dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.

Repository Structure

Chameleon-BOD/
â”œâ”€â”€ README.md          <-- (This file)
â”œâ”€â”€ code/
â”‚   â”œâ”€â”€ MMLU_Eval.py
â”‚   â””â”€â”€ mmlu_rephrase_DS.py
â”œâ”€â”€ Data/
â”‚   â””â”€â”€ rephrased_mmlu_test_parallel_temp1_0.json
â”œâ”€â”€ Results/
â”‚   â””â”€â”€ results_{model_name}.json
â””â”€â”€ paper/
    â””â”€â”€ paper.pdf

code/: Contains our experimental Python scripts:
- MMLU_Eval.py â€“ Evaluate LLM predictions on original and rephrased prompts.
- mmlu_rephrase_DS.py â€“ Uses the DeepSeek API to generate a perturbed (rephrased) dataset.
paper/: Contains the paper.
Data/: Contains the rephrased dataset:
- rephrased_mmlu_test_parallel_temp1_0.json â€“ the rephrased dataset (Î¼=1.0).
Results/: Contains the results of the evaluated LLMs:
- results_{model_name}.json â€“ The results of LLM {model_name}.

Requirements

To run the code, you need the following packages:

torch (>= 1.10.0)
transformers (>= 4.25.0)
tqdm (>= 4.60.0)
requests (>= 2.25.0)
datasets (>= 2.0.0)

Getting Started

1. Clone the Repository

git clone https://github.com/yourusername/Chameleon-BOD.git
cd Chameleon-BOD

2. Set Up a Virtual Environment (Optional)

Itâ€™s a good idea to create a virtual environment:

python -m venv venv
source venv/bin/activate   # On Windows use: venv\Scripts\activate

3. Install the Dependencies

pip install -r requirements.txt

4. Running the Experiments

a. Evaluate LLM Predictions

To run the evaluation script with a Hugging Face model (for example, using microsoft/phi-4):

python code/MMLU_Eval.py --model microsoft/phi-4 --batch_size 1

This command loads the specified model from Hugging Face and evaluates it on the original and rephrased MMLU questions.

b. Generate the Rephrased Dataset

Before evaluation, generate a rephrased dataset using the DeepSeek API:

Update the API Key:
Open the file code/mmlu_rephrase_DS.py and replace the placeholder "XXXXXXXXXXXXXX" with your actual DeepSeek API key.
Run the Script:

python code/mmlu_rephrase_DS.py

This script processes the MMLU test set and saves a rephrased version to rephrased_mmlu_test_parallel_temp1_0.json.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.4.0

Feb 18, 2025

0.3.0

Feb 18, 2025

0.2.0

Feb 18, 2025

0.1.0

Feb 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cbod-0.4.0.tar.gz (6.5 kB view details)

Uploaded Feb 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

CBOD-0.4.0-py3-none-any.whl (6.6 kB view details)

Uploaded Feb 18, 2025 Python 3

File details

Details for the file cbod-0.4.0.tar.gz.

File metadata

Download URL: cbod-0.4.0.tar.gz
Upload date: Feb 18, 2025
Size: 6.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for cbod-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`87ab7fe4c9afb1db442027d7aeeacb4c4352c8bc52e33e8a1ecb2d942eea59c8`
MD5	`4a7cd9ef4243af680aa93feee76f8707`
BLAKE2b-256	`66a15460cec0bda0b912e93989c57e7de3f41f6591e14bd70ec46d3b3540f9eb`

See more details on using hashes here.

File details

Details for the file CBOD-0.4.0-py3-none-any.whl.

File metadata

Download URL: CBOD-0.4.0-py3-none-any.whl
Upload date: Feb 18, 2025
Size: 6.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for CBOD-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dced0a703426817d81358c507eec9a977663dfa9fe63d0ba8ceb52ae236aa397`
MD5	`35257b607a7adbf3d14b4e6b5927bc63`
BLAKE2b-256	`9543377ba9fa1837e9e57e454356efd2a6beecccca710802f10b928022d7b25e`

See more details on using hashes here.

CBOD 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Chameleon-BOD: Detecting Overfit in LLM Evaluations

Abstract

Repository Structure

Requirements

Getting Started

1. Clone the Repository

2. Set Up a Virtual Environment (Optional)

3. Install the Dependencies

4. Running the Experiments

a. Evaluate LLM Predictions

b. Generate the Rephrased Dataset

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes