An implementation of the Linformer in Pytorch
Project description
Linformer Pytorch Implementation
A practical implementation of the Linformer paper.
Has not been empirically tested (i.e. if it performs well on any datasets), but the self attention mechanism works.
I am not the author of the paper.
Install
pip install linformer-pytorch
Alternatively,
git clone https://github.com/tatp22/linformer-pytorch.git
cd linformer-pytorch
Code example
from linformer_pytorch import Linformer
import torch
device = torch.device("cuda")
model = Linformer(
input_size=262144, # Dimension 1 of the input
channels=64, # Dimension 2 of the input
dim_d=256, # The inner dimension of the attention heads
dim_k=128, # The second dimension of the P_bar matrix from the paper
dim_ff=128, # Dimension in the feed forward network
dropout_ff=0.15, # Dropout for feed forward network
nhead=4, # Number of attention heads
depth=2, # How many times to run the model
dropout=0.1, # How much dropout to apply to P_bar after softmax
activation="gelu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
).cuda()
x = torch.randn(1, 262144, 64).cuda()
y = model(x)
print(y)
Checkpoint levels
As an attempt to further introduce memory savings, the concept of checkpoint levels have been introduced. The current three checkpoint levels are C0
, C1
, and C2
. When going up checkpoint levels, one sacrifices speed for memory savings. That is, checkpoint level C0
is the fastest, but takes up the most space on the GPU, while C2
is the slowest, but takes up the least space on the GPU. The details of each checkpoint level are as follows:
C0
: No checkpointing. The models runs while keeping all of the attention heads and ff layers in the GPU memory.C1
: Checkpoint each MultiHead attention as well as each ff layer. With this, increasingdepth
should have minimal impact on the memory.C2
: Along with the optimizations at theC1
level, checkpoint each head in each MultiHead Attention layer. With this, increasingnhead
should have less of an impact on memory. However, concating the heads together withtorch.cat
still takes up a lot of memory, and this will hopefully be optimized out in the future.
Performance details are still unknown, but the option exists for users that want to try.
Padder
One slight problem with the current implementation of the Linformer is that your sequence length has to match the input_size
flag of the model. The Padder pads the input size such that the tensor can be fed into the network. An example:
from linformer_pytorch import Linformer, Padder
import torch
model = Linformer(
input_size=512,
channels=16,
dim_d=32,
dim_k=16,
dim_ff=32,
nhead=6,
depth=3,
checkpoint_level="C1",
)
model = Padder(model)
x = torch.randn(1, 500, 16) # This does not match the input size!
y = model(x)
print(y) # (1, 500, 16)
Practical Tips
- Note that the Linformer has O(nk) time and space complexity. So, while it may be linear in n, make sure that your k is not too large as well. These are editable with
input_size
anddim_k
, respectively. - Speaking about k, the authors found that empirical evidence supports the fact that "the performance of Linformer model is mainly determined by the projected dimension k instead of the ratio n/k". Therefore, even when increasing sequence lengths, it may be fine to keep a relatively low, constant k (the authors showed with k=256, that it still performed almost as good as a vanilla transformer).
- One more tip for k: The authors recommend that k = O(d/eps^2), if self attention wants to be approximated by full attention, with eps error.
- This code, so far, is pretty much only linear layers as well as matrix multiplications. So, libraries like
apex
should work with this, however, in practice, it has not been tested. - In practice, I found that the memory and time requirements are more on the order of O(nkd), with n=
input_size
, k=dim_k
, and d=dim_d
.
Future work
Change theeinsum
s tomatmul
for faster multiplicationFix a bug where the model is using too much memory. Probably has to do with the inner dimension.- Add positional embeddings
- Add option to change the
E
andF
downsampling matrices - Run some benchmark tests to see what the performance is
- Instead of matrix multiplication to bring the dimensions down to k (With EKW and FVW), try to do convolution, as mentioned in the paper, with a stride length and kernel size of n/k.
- In the paper, empirical studies showed that one can reduce the value of k when increasing depth. Add some option to decrease k more per layers, saving even more memory.
Disclaimer
This is the first time that I am reproducing a result from a paper, so some things may be wrong. If you see a problem, please open up an issue, and I will attempt to work on it.
Thanks
Thank you to lucidrains, whose other sparse attention repositories helped me in designing this Linformer Repo.
Citations
@misc{wang2020linformer,
title={Linformer: Self-Attention with Linear Complexity},
author={Sinong Wang and Belinda Z. Li and Madian Khabsa and Han Fang and Hao Ma},
year={2020},
eprint={2006.04768},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for linformer_pytorch-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b41bc97526b75c27fd008c1586d1e1b3ea42df5c11c2563b367cf223f46f3ccd |
|
MD5 | 0375d33f4a8f0ebe74b447483eb095c1 |
|
BLAKE2b-256 | 9974806ed5f6f3c74d936c2b34d5fa03b7fbfcb5dabf3d27f11a93422b1267a9 |