Training of CLIP in JAX

Project description

CLIP-JAX

This repository is used to train vision models with JAX:

many types of model architectures
any sharding strategy
training with constrastive loss such as CLIP, chunked sigmoid loss or captioning loss such as CapPa
downstream fine-tuning

Refer to the report "CapPa: Training vision models as captioners" for the open-source reproduction of CapPa.

Installation

pip install clip-jax

Note: this package is currently under active development, install from source for latest version.

Usage

Use a trained model

Refer to utils/demo_cappa.ipynb.

Download training data

You can download training data from DataComp:

# clone and install datacomp

# download data
python download_upstream.py \
    --scale small --data_dir gs://my_bucket/datacomp/small metadata_dir metadata \
    --image_size 256 --resize_mode center_crop --skip_bbox_blurring --no_resize_only_if_bigger \
    --encode_format webp --output_format tfrecord

Alternatively, you can use your own dataset. In that case you should use img2dataset with output_format="tfrecord".

Train a model

Use training/train.py to train a model:

Here is an example command to train a model on a TPU v3-8:

python train.py \
    --assert_TPU_available \
    --config_name ../configs/small-patch16.json --dtype float32 \
    --do_train --train_folder gs://my_bucket/datacomp/small/shards \
    --output_dir gs://my_bucket/clip_model/$(date +"%Y%m%d%H%M%S") \
    --num_train_epochs 10 \
    --tokenizer_name openai/clip-vit-base-patch32 \
    --batch_size_per_node 4096 --gradient_accumulation_steps 1 \
    --learning_rate 0.00001 --warmup_steps 2000 --lr_offset 0 \
    --optim distributed_shampoo --beta1 0.9 --beta2 0.99 --weight_decay 0.0 \
    --block_size_text 512 --block_size_vision 512 --nesterov \
    --graft_type rmsprop_normalized --preconditioning_compute_steps 20 \
    --mp_devices 1 --shard_shampoo_across 2d \
    --activation_partitioning_dims 1 --parameter_partitioning_dims 1 \
    --loss_type sigmoid \
    --gradient_checkpointing \
    --unroll 100 \
    --logging_steps 100 --save_steps 5000

Acknowledgements

Lucas Beyer for helping with clarifications on the Sigmoid Loss for Language Image Pre-Training paper and Image Captioners Are Scalable Vision Learners Too
Timothée Darcet for helping with clarifications on the Vision Transformers Need Registers paper
🤗 Hugging Face for reference implementation of CLIP
Google TPU Research Cloud (TRC) program for providing computing resources
Weights & Biases for providing the infrastructure for experiment tracking and model management
Big Vision Github Repository for reference code of many papers

Citations

@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision},
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training},
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{zhai2022scaling,
      title={Scaling Vision Transformers}, 
      author={Xiaohua Zhai and Alexander Kolesnikov and Neil Houlsby and Lucas Beyer},
      year={2022},
      eprint={2106.04560},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{tschannen2023image,
      title={Image Captioners Are Scalable Vision Learners Too}, 
      author={Michael Tschannen and Manoj Kumar and Andreas Steiner and Xiaohua Zhai and Neil Houlsby and Lucas Beyer},
      year={2023},
      eprint={2306.07915},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{darcet2023vision,
      title={Vision Transformers Need Registers}, 
      author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
      year={2023},
      eprint={2309.16588},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{dehghani2023patch,
      title={Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution}, 
      author={Mostafa Dehghani and Basil Mustafa and Josip Djolonga and Jonathan Heek and Matthias Minderer and Mathilde Caron and Andreas Steiner and Joan Puigcerver and Robert Geirhos and Ibrahim Alabdulmohsin and Avital Oliver and Piotr Padlewski and Alexey Gritsenko and Mario Lučić and Neil Houlsby},
      year={2023},
      eprint={2307.06304},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{mckinzie2024mm1,
      title={MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training}, 
      author={Brandon McKinzie and Zhe Gan and Jean-Philippe Fauconnier and Sam Dodge and Bowen Zhang and Philipp Dufter and Dhruti Shah and Xianzhi Du and Futang Peng and Floris Weers and Anton Belyi and Haotian Zhang and Karanjeet Singh and Doug Kang and Ankur Jain and Hongyu Hè and Max Schwarzer and Tom Gunter and Xiang Kong and Aonan Zhang and Jianyu Wang and Chong Wang and Nan Du and Tao Lei and Sam Wiseman and Guoli Yin and Mark Lee and Zirui Wang and Ruoming Pang and Peter Grasch and Alexander Toshev and Yinfei Yang},
      year={2024},
      eprint={2403.09611},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{hsieh2023sugarcrepefixinghackablebenchmarks,
      title={SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality}, 
      author={Cheng-Yu Hsieh and Jieyu Zhang and Zixian Ma and Aniruddha Kembhavi and Ranjay Krishna},
      year={2023},
      eprint={2306.14610},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2306.14610}, 
}

Project details

Release history Release notifications | RSS feed

This version

0.0.5

Jul 4, 2024

0.0.4

Jul 4, 2024

0.0.2

Jul 15, 2023

0.0.1.post1

Jul 29, 2023

0.0.1

Feb 4, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clip_jax-0.0.5.tar.gz (72.8 kB view details)

Uploaded Jul 4, 2024 Source

Built Distribution

clip_jax-0.0.5-py2.py3-none-any.whl (99.2 kB view details)

Uploaded Jul 4, 2024 Python 2Python 3

File details

Details for the file clip_jax-0.0.5.tar.gz.

File metadata

Download URL: clip_jax-0.0.5.tar.gz
Upload date: Jul 4, 2024
Size: 72.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for clip_jax-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`375cda5f9fdbbc511cafd9ea9327fec6a9899258d86a7906578d832237cdde21`
MD5	`d9e71d63733662309b988bf45815ee02`
BLAKE2b-256	`41b5e6036c55e1f1e95283213c33a1d9ee4faa8473656fdbb1216e88b0f0231b`

See more details on using hashes here.

File details

Details for the file clip_jax-0.0.5-py2.py3-none-any.whl.

File metadata

Download URL: clip_jax-0.0.5-py2.py3-none-any.whl
Upload date: Jul 4, 2024
Size: 99.2 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for clip_jax-0.0.5-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`464eeb9b41e3452b1cb51657dd39369dae0c0dfbee15780290005a6a77b29995`
MD5	`3d895f5cb2a752a57b3cf5a956dd13ce`
BLAKE2b-256	`9fdf75a7c2b15be0a9ed1765cb041c69c343ab3aef3a3860f1e1af38800a7ea1`

See more details on using hashes here.

clip-jax 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project description

CLIP-JAX

Installation

Usage

Use a trained model

Download training data

Train a model

Acknowledgements

Citations

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes