mega-vit - Pytorch
Project description
MegaVit
A simple implementation of a CLIP that splits up an image into quandrants and then gets the embeddings for each quandrant
Appreciation
- Lucidrains
- Agorians
Install
pip install mega-vit
Usage
- Simple usage,
import torch
from mega_vit.main import MegaVit
v = MegaVit(
image_size = 256,
patch_size = 32,
num_classes = 1000,
dim = 1024,
depth = 6,
heads = 16,
mlp_dim = 2048,
dropout = 0.1,
emb_dropout = 0.1
)
img = torch.randn(1, 3, 256, 256)
preds = v(img) # (1, 1000)
print(preds)
- Hyperparams as stated in paper:
import torch
from mega_vit.main import MegaVit
v = ViT(
image_size = 224,
patch_size = 14,
num_classes = 1000,
dim = 6144,
depth = 48,
heads = 48,
mlp_dim = 2048,
dropout = 0.1,
emb_dropout = 0.1
)
img = torch.randn(1, 3, 224, 224)
preds = v(img) # (1, 1000)
print(preds)
Architecture
Dataset Strategy
The paper trains ViT-22B on a version of the JFT dataset that has been extended to around 4 billion images. JFT is a large-scale dataset scraped from the internet, originally containing over 300 million images labeled with a hierarchical taxonomy of 30,000 categories.
The authors do not provide full details on how the dataset was extended from the original JFT to 4 billion images. However, the goal seems to be creating a larger and more diverse training set to support scaling up the model size. Pre-training on larger datasets enables learning more robust and generalizable visual representations.
The authors evaluate ViT-22B on a comprehensive set of 39 datasets covering various domains like image classification, dense prediction tasks, video, and fairness benchmarks. Using such a diverse evaluation suite allows them to thoroughly assess the scalability and transferability of ViT-22B across different domains and data distributions.
Below is a table summarizing some of the key datasets used in the paper:
Dataset | Domain | Images | Classes |
---|---|---|---|
JFT (training set) | Internet images | ~4 billion | 30,000 |
ImageNet | Natural images | 1.28M | 1000 |
ImageNet-C | Corrupted ImageNet images | 1.28M | 1000 |
ImageNet-R | Hard ImageNet images | 30K | 200 |
ImageNet-A | Adversarial ImageNet images | 7.5K | 200 |
ObjectNet | Natural images | 113K | 113 |
Cifar-10 | Tiny natural images | 60K | 10 |
Cifar-100 | Tiny natural images | 60K | 100 |
ADE20K | Scene parsing | 25K | 150 |
Kinetics-400 | Human action videos | 400K | 400 |
CelebA | Celeb faces | 202K | 40 |
License
MIT
Citations
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.