Calculate the longppl of long-context LLMs
Project description
LongPPL
This repository is the official implementation for ICLR 2025 paper What is Wrong with Perplexity for Long-context Language Modeling?
Introduction
Handling long-context inputs is crucial for large language models (LLMs). While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose LongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Additionally, we introduce LongCE (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens.
Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Besides, experimental results also show that LongCE attains consistent improvements in a plug-and-play solution.
Requirements
Python 3.10 + Pytorch 2.3 + Transformers 4.45
pip install -r requirements.txt
LongPPL
The code support calculating LongPPL on customized LLMs and datasets. Please run:
pip install longppl
or
git clone https://github.com/PKU-ML/LongPPL.git
cd LongPPL
pip install -e .
and use the following code to calculate LongPPL:
from longppl import compute_longppl
output = compute_longppl(text, model, evaluator_model, tokenizer, evaluator_tokenizer)
print(output['longppl'])
Reproduce the paper
LongPPL
To reproduce the LongPPL experiments in our paper, please run:
cd perplexity
sh run_ppl.sh
The evaluation data can be downloaded from GovReport (tokenized). Here are our main results.
| Models | LongPPL(Qwen-72B-Instruct) | LongPPL(Mistral Large 2) | LongPPL(Llama-3.1-8B) | PPL |
|---|---|---|---|---|
| Mixtral-8x7B | 1.99 | 2.33 | 1.70 | 3.59 |
| FILM-7B | 2.28 | 2.81 | 1.95 | 4.35 |
| Mistral-7B | 2.48 | 3.10 | 2.11 | 4.14 |
| Qwen1.5-14B | 2.67 | 2.57 | 2.19 | 5.07 |
| Qwen2-7B | 2.66 | 2.48 | 2.16 | 4.82 |
| Phi-3-small | 2.66 | 2.58 | 2.28 | 5.29 |
| CLEX-7B | 3.28 | 3.95 | 2.74 | 4.04 |
| Yi-6B | 3.19 | 3.38 | 2.65 | 4.96 |
| Yarn-7B | 3.47 | 4.51 | 2.98 | 4.06 |
- While perplexity shows almost no correlation to their long-context performance measured by the benchmarks (please refer to our paper), LongPPL demonstrates a strong correlation.
LongCE
To conduct long-context finetuning with LongCE, run accelerate config and enable DeepSpeed acceleration. deepspeed/zero3.json was the configuration file used for training.
cd finetune
sh train.sh
The training data can be downloaded from PG19 and Pile-arxiv.
To run models with eabf, please downgrade the version of transformers to 4.37.0.
Evaluation on Long-context Benchmark
In the paper, we evaluate models on LongBench, LongEval and RULER. Please refer to the respective code repositories.
Citation
If you use our code, please cite
@article{fang2024wrong,
title={What is Wrong with Perplexity for Long-context Language Modeling?},
author={Lizhe Fang and Yifei Wang and Zhaoyang Liu and Chenheng Zhang and Stefanie Jegelka and Jinyang Gao and Bolin Ding and Yisen Wang},
year={2024},
journal={arXiv preprint arXiv:2410.23771}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file longppl-0.3.0.tar.gz.
File metadata
- Download URL: longppl-0.3.0.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
186cb4d6ca888b971702c9d6d2762b78355d949b5d9c8eebf28b774bcd7476b6
|
|
| MD5 |
1ff1bc2ba523a8ebc3cdcdf920f49417
|
|
| BLAKE2b-256 |
93a391a50a461babd20c617d147da97091a20b6ea9e0a5912f49cd696fc90fc5
|
File details
Details for the file longppl-0.3.0-py3-none-any.whl.
File metadata
- Download URL: longppl-0.3.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4764842acb4508378b2dd03d49c67e9d6ef5ca5c34308ca824e0da33a9a69db5
|
|
| MD5 |
9d1351181b5078e19ac2cb844432cec3
|
|
| BLAKE2b-256 |
2d5d7ad89fc82fdb315d00a8113b99e077b47a3fbc1d6e538cc7d54b3ee6d8e9
|