Extend LLMs to infinite length without sacrificing efficiency and performance, without retraining
Project description
Attention Sinks in Transformers for Infinite-length LLMs
Llama 2 7B | Falcon 7B |
---|---|
MPT 7B | Pythia 6.9B |
Mistral 7B | |
Overview
- Extend existing LLMs (e.g. Llama 2) to infinite length without sacrificing efficiency and performance, without any retraining.
- Model perplexities were stable even after 4 million tokens!
- Unlike with regular
transformers
, there is no linear memory increase and no extremely slow inference due to memory issues at higher sequence lengths.
- The
attention_sinks
API allows for a drop-in replacement of thetransformers
API:from attention_sinks import AutoModel model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto")
- Support for Llama, Falcon, MPT, GPTNeoX (Pythia) and Mistral models.
- New parameters to
AutoModel....from_pretrained
:attention_sink_size
, int, defaults to 4: The number of initial tokens to use as the attention sink. These tokens are always included in the Attention Sink KV Cache.attention_sink_window_size
, int, defaults to 1020: The size of the sliding window, i.e. the number of "recent tokens" to include in the Attention Sink KV Cache.
Installation
You can install attention_sinks
like so
pip install attention_sinks
Benchmarks
Pre-prepared benchmarks
See benchmark/scripts for a collection of ready-to-go scripts for various model architectures like Llama 2, Falcon, MPT and GPT-NeoX (Pythia). Each of these scripts runs the benchmarking and plotting tools described below for pure transformers
, attention_sinks
and a third alternative: windowed
, which involves simple windowed attention at a window size of 1024 tokens. Upon completion, the script will plot the figures that you see at the top of this README.
Benchmarking tool
You can run a few benchmarks to compute the perplexity of various models over time using the provided perplexity.py benchmarking script. This is done by computing the negative log likelihood losses of the chosen model when it is provided a full book with 60k+ tokens. By default, the scripts stop after 8192 tokens, but this can be modified. An ideal solution continuously has a low log perplexity and a constant CUDA VRAM usage.
To use the script, you can run:
python benchmark/perplexity.py --experiment attention_sinks
Full argument list
usage: perplexity.py [-h] [--experiment {attention_sinks,transformers,windowed}] [--model_name_or_path MODEL_NAME_OR_PATH] [--revision REVISION]
[--trust_remote_code] [--dataset_name DATASET_NAME] [--data_column DATA_COLUMN] [--task TASK] [--split {validation,test}]
[--num_tokens NUM_TOKENS] [--output_dir OUTPUT_DIR] [--window_size WINDOW_SIZE] [--attention_sink_size ATTENTION_SINK_SIZE]
options:
-h, --help show this help message and exit
--experiment {attention_sinks,transformers,windowed}
--model_name_or_path MODEL_NAME_OR_PATH
--revision REVISION
--trust_remote_code
--dataset_name DATASET_NAME
--data_column DATA_COLUMN
--task TASK
--split {validation,test}
--num_tokens NUM_TOKENS
--output_dir OUTPUT_DIR
--window_size WINDOW_SIZE
--attention_sink_size ATTENTION_SINK_SIZE
This script will create a csv
file in the output directory ("benchmarks/outputs"
by default) for that experiment, with information about perplexities, CUDA VRAM usage and latencies.
Plotting tool
The information from the benchmarking tool can be plotted using the plot_perplexity.py script. In particular, you can plot any combination of the following features:
perplexity
,vram
, i.e. CUDA VRAM usage,latency
.
For example:
python benchmark/plot_perplexity.py --features perplexity latency --title "Log perplexity & latency of Llama 2 7B as a function of input lengths"
Full argument list
usage: plot_perplexity.py [-h] [--output_dir OUTPUT_DIR] [--features {perplexity,vram,latency} [{perplexity,vram,latency} ...]] [--title TITLE]
[--log_perplexity_limit LOG_PERPLEXITY_LIMIT] [--skip_first SKIP_FIRST]
options:
-h, --help show this help message and exit
--output_dir OUTPUT_DIR
--features {perplexity,vram,latency} [{perplexity,vram,latency} ...]
--title TITLE
--log_perplexity_limit LOG_PERPLEXITY_LIMIT
--skip_first SKIP_FIRST
This script takes all csv
files from the output directory ("benchmark/outputs"
by default), and creates a plot like so:
python benchmark/plot_perplexity.py --features perplexity vram --title "Log perplexity & VRAM usage of Llama 2 7B as a function of input lengths" --output_dir benchmark/outputs_llama_2_7b --log_perplexity_limit 4
Clear as day:
transformers
: The VRAM usage is linear as it doesn't do any windowing. The performance heavily falls after ~4096 tokens.windowed
: The VRAM is constant usage due to the windowing at 1024 tokens. However, it fails as soon as the first tokens leave the window.attention_sinks
: Constant VRAM usage due to windowing with 4 attention sink tokens + the 1020 most recent tokens. This approach never fails despite the constant VRAM usage.
I've uploaded outputs of various benchmarks in benchmark so you can reproduce this graph using the former command.
Changelog
See CHANGELOG.md for all release information.
Credits
Inspired by, and adapted from StreamingLLM.
Citation
@article{xiao2023streamingllm,
title={Efficient Streaming Language Models with Attention Sinks},
author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike},
journal={arXiv},
year={2023}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for attention_sinks-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 065c1748a0d3a31fb11aa88b2c56d828f06a6c7d2f949aba69dd8bdfff76bbeb |
|
MD5 | 2e3137884074f0975f41104f28b90493 |
|
BLAKE2b-256 | 211b6208ba90b00c8a89c3b92173555131335790d0e51e9c75ea89225e917c5c |