Vision Xformers
Project description
ViX
Vision Xformers: Efficient Attention for Image Classification
We use Linear Attention mechanisms to replace quadratic attention in ViT for image classification. We show that models using linear attention and CNN embedding layers need less parameters and low GPU requirements for achieving good accuracy. These improvements can be used to democratize the use of transformers by practitioners who are limited by data and GPU.
Hybrid ViX uses convolutional layers instead of linear layer for generating embeddings
Rotary Postion Embedding (RoPE) is also used in our models instead of 1D learnable position embeddings
Nomenclature: We replace the X in ViX with the starting alphabet of the attention mechanism used Eg. When we use Performer in ViX, we replace the X with P, calling it ViP (Vision Performer)
'Hybrid' prefix is used in models which uses convolutional layers instead of linear embeddding layer.
We have added RoPE in the title of models which used Rotary Postion Embedding
The code for using all for these models for classification of CIFAR 10/Tiny ImageNet dataset is provided
Models
- Vision Linformer (ViL)
- Vision Performer (ViP)
- Vision Nyströmformer (ViN)
- FNet
- Hybrid Vision Transformer (HybridViT)
- Hybrid Vision Linformer (HybridViL)
- Hybrid Vision Performer (HybridViP)
- Hybrid Vision Nyströmformer (HybridViN)
- Hybrid FNet
- LeViN (Replacing Transformer in LeViT with Nyströmformer)
- LeViP (Replacing Transformer in LeViT with Performer)
- CvN (Replacing Transformer in CvT with Nyströmformer)
- CvP (Replacing Transformer in CvT with Performer)
- CCN (Replacing Transformer in CCT with Nyströmformer)
- CCP(Replacing Transformer in CCT with Performer)
We have adapted the codes for ViT and linear transformers from @lucidrains
More information about these models can be obtained from our paper : ArXiv Paper, WACV 2022 Paper
If you wish to cite this, please use:
@misc{jeevan2021vision,
title={Vision Xformers: Efficient Attention for Image Classification},
author={Pranav Jeevan and Amit Sethi},
year={2021},
eprint={2107.02239},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@InProceedings{Jeevan_2022_WACV,
author = {Jeevan, Pranav and Sethi, Amit},
title = {Resource-Efficient Hybrid X-Formers for Vision},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {January},
year = {2022},
pages = {2982-2990}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vision_xformer-0.1.5.tar.gz
.
File metadata
- Download URL: vision_xformer-0.1.5.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 948977e97546a5b31ff6e242c7d460110ab3e7deab9409267863f3d1b01a2dfa |
|
MD5 | 52b45525051617025453ebe2a1c28f69 |
|
BLAKE2b-256 | e91c37c72fe76c37e0c5584f18839d088db104230be404c9b1d60629e06815e1 |
File details
Details for the file vision_xformer-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: vision_xformer-0.1.5-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 367e2a15b932d4e10e7bed523d2057f2b696f5be05d7847cf498d55c73e580dd |
|
MD5 | 10b51f1b10df5d2ebcb9716930f7a089 |
|
BLAKE2b-256 | b54d2cf079b3477747b2e68864715b415346667ef8d87ae77bdab3b7e0c023ad |