Skip to main content

Spatial Shift Vision Transformer

Project description

Spatial Shift ViT

S²-ViT is a hierarchical vision transformer with shifted window attention. In contrast to Swin, the shift operation used is based on S²-MLP, which shifts in all four directions simultaneously, and does not use the roll or unroll operation. Additionally, it leverages the patch embedding and positional encoding methods from Twins-SVT, and StarReLU from MetaFormer.

Prerequisites

  • Python 3.10+
  • PyTorch 2.0+

Installation

pip install s2vit

Usage

import torch
from s2vit import S2ViT

vit = S2ViT(
    depths=(2, 2, 6, 2),
    dims=(64, 128, 160, 320),
    global_pool=True
    num_classes=1000,
)

img = torch.randn(1, 3, 256, 256)
vit(img) # (1, 1000)

Acknowledgements

lucidrains for his excellent work, including vit-pytorch, x-transformers, and his discovery of shared key / value attention.

Citations

@article{Yu2021S2MLPSM,
  title={S2-MLP: Spatial-Shift MLP Architecture for Vision},
  author={Tan Yu and Xu Li and Yunfeng Cai and Mingming Sun and Ping Li},
  journal={2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2021},
  pages={3615-3624},
  url={https://api.semanticscholar.org/CorpusID:235422259}
}
@article{Liu2021SwinTH,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng Zhang and Stephen Lin and Baining Guo},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021},
  pages={9992-10002},
  url={https://api.semanticscholar.org/CorpusID:232352874}
}
@article{Liu2021SwinTV,
  title={Swin Transformer V2: Scaling Up Capacity and Resolution},
  author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={11999-12009},
  url={https://api.semanticscholar.org/CorpusID:244346076}
}
@inproceedings{Chu2021TwinsRT,
  title={Twins: Revisiting the Design of Spatial Attention in Vision Transformers},
  author={Xiangxiang Chu and Zhi Tian and Yuqing Wang and Bo Zhang and Haibing Ren and Xiaolin Wei and Huaxia Xia and Chunhua Shen},
  booktitle={Neural Information Processing Systems},
  year={2021},
  url={https://api.semanticscholar.org/CorpusID:234364557}
}
@article{Yu2022MetaFormerBF,
  title={MetaFormer Baselines for Vision},
  author={Weihao Yu and Chenyang Si and Pan Zhou and Mi Luo and Yichen Zhou and Jiashi Feng and Shuicheng Yan and Xinchao Wang},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.13452},
  url={https://api.semanticscholar.org/CorpusID:253098429}
}
@article{Touvron2022ThreeTE,
  title={Three things everyone should know about Vision Transformers},
  author={Hugo Touvron and Matthieu Cord and Alaaeldin El-Nouby and Jakob Verbeek and Herv'e J'egou},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.09795},
  url={https://api.semanticscholar.org/CorpusID:247594673}
}
@article{Chowdhery2022PaLMSL,
  title={PaLM: Scaling Language Modeling with Pathways},
  author={Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam M. Shazeer and Vinodkumar Prabhakaran and Emily Reif and Nan Du and Benton C. Hutchinson and Reiner Pope and James Bradbury and Jacob Austin and Michael Isard and Guy Gur-Ari and Pengcheng Yin and Toju Duke and Anselm Levskaya and Sanjay Ghemawat and Sunipa Dev and Henryk Michalewski and Xavier Garc{\'i}a and Vedant Misra and Kevin Robinson and Liam Fedus and Denny Zhou and Daphne Ippolito and David Luan and Hyeontaek Lim and Barret Zoph and Alexander Spiridonov and Ryan Sepassi and David Dohan and Shivani Agrawal and Mark Omernick and Andrew M. Dai and Thanumalayan Sankaranarayana Pillai and Marie Pellat and Aitor Lewkowycz and Erica Moreira and Rewon Child and Oleksandr Polozov and Katherine Lee and Zongwei Zhou and Xuezhi Wang and Brennan Saeta and Mark D{\'i}az and Orhan Firat and Michele Catasta and Jason Wei and Kathleen S. Meier-Hellstern and Douglas Eck and Jeff Dean and Slav Petrov and Noah Fiedel},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.02311},
  url={https://api.semanticscholar.org/CorpusID:247951931}
}
@article{Bondarenko2023QuantizableTR,
  title={Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing},
  author={Yelysei Bondarenko and Markus Nagel and Tijmen Blankevoort},
  journal={ArXiv},
  year={2023},
  volume={abs/2306.12929},
  url={https://api.semanticscholar.org/CorpusID:259224568}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s2vit-0.5.1.tar.gz (11.1 kB view hashes)

Uploaded Source

Built Distribution

s2vit-0.5.1-py3-none-any.whl (9.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page