Skip to main content

Calculate the longppl of long-context LLMs

Project description

LongPPL

This repository is the official implementation for ICLR 2025 paper What is Wrong with Perplexity for Long-context Language Modeling?

Introduction

Handling long-context inputs is crucial for large language models (LLMs). While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose LongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Additionally, we introduce LongCE (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens.

LongPPL

Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Besides, experimental results also show that LongCE attains consistent improvements in a plug-and-play solution.

Requirements

Python 3.10 + Pytorch 2.3 + Transformers 4.45

pip install -r requirements.txt

LongPPL

The code support calculating LongPPL on customized LLMs and datasets. Please run:

pip install longppl

or

git clone https://github.com/PKU-ML/LongPPL.git
cd LongPPL
pip install -e .

and use the following code to calculate LongPPL:

from longppl import compute_longppl

output = compute_longppl(text, model, evaluator_model, tokenizer, evaluator_tokenizer)
print(output['longppl'])

Reproduce the paper

LongPPL

To reproduce the LongPPL experiments in our paper, please run:

cd perplexity
sh run_ppl.sh

The evaluation data can be downloaded from GovReport (tokenized). Here are our main results.

Models LongPPL(Qwen-72B-Instruct) LongPPL(Mistral Large 2) LongPPL(Llama-3.1-8B) PPL
Mixtral-8x7B 1.99 2.33 1.70 3.59
FILM-7B 2.28 2.81 1.95 4.35
Mistral-7B 2.48 3.10 2.11 4.14
Qwen1.5-14B 2.67 2.57 2.19 5.07
Qwen2-7B 2.66 2.48 2.16 4.82
Phi-3-small 2.66 2.58 2.28 5.29
CLEX-7B 3.28 3.95 2.74 4.04
Yi-6B 3.19 3.38 2.65 4.96
Yarn-7B 3.47 4.51 2.98 4.06
  • While perplexity shows almost no correlation to their long-context performance measured by the benchmarks (please refer to our paper), LongPPL demonstrates a strong correlation.

LongCE

To conduct long-context finetuning with LongCE, run accelerate config and enable DeepSpeed acceleration. deepspeed/zero3.json was the configuration file used for training.

cd finetune
sh train.sh

The training data can be downloaded from PG19 and Pile-arxiv. To run models with eabf, please downgrade the version of transformers to 4.37.0.

Evaluation on Long-context Benchmark

In the paper, we evaluate models on LongBench, LongEval and RULER. Please refer to the respective code repositories.

Citation

If you use our code, please cite

@article{fang2024wrong,
      title={What is Wrong with Perplexity for Long-context Language Modeling?}, 
      author={Lizhe Fang and Yifei Wang and Zhaoyang Liu and Chenheng Zhang and Stefanie Jegelka and Jinyang Gao and Bolin Ding and Yisen Wang},
      year={2024},
      journal={arXiv preprint arXiv:2410.23771}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

longppl-0.3.0.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

longppl-0.3.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file longppl-0.3.0.tar.gz.

File metadata

  • Download URL: longppl-0.3.0.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for longppl-0.3.0.tar.gz
Algorithm Hash digest
SHA256 186cb4d6ca888b971702c9d6d2762b78355d949b5d9c8eebf28b774bcd7476b6
MD5 1ff1bc2ba523a8ebc3cdcdf920f49417
BLAKE2b-256 93a391a50a461babd20c617d147da97091a20b6ea9e0a5912f49cd696fc90fc5

See more details on using hashes here.

File details

Details for the file longppl-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: longppl-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for longppl-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4764842acb4508378b2dd03d49c67e9d6ef5ca5c34308ca824e0da33a9a69db5
MD5 9d1351181b5078e19ac2cb844432cec3
BLAKE2b-256 2d5d7ad89fc82fdb315d00a8113b99e077b47a3fbc1d6e538cc7d54b3ee6d8e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page