small vlm for training and experiments
Project description
small-vlm
A flexible and configurable Vision Language Model (VLM) framework built with PyTorch, designed for experimentation and ease of use. This framework allows for modular replacement of core components and fine-grained control over training parameters.
Features
- Modular Design: Easily swap out the Language Model (LLM), Visual Encoder, and Connector components to experiment with different architectures.
- Configuration Management: Utilizes Hydra for robust and flexible configuration management, allowing you to define and override parameters easily.
- Environment Setup: Uses uv for fast and reliable Python environment and package management.
- Granular Training Control:
- Independently set learning rates and weight decay for the LLM, visual encoder, and connector.
- Independently freeze or unfreeze these components during different training stages.
- LLaVA Implementation: Includes a straightforward reproduction of the LLaVA model (pretraining and finetuning).
- Hugging Face Hub Integration:
- Easily push your trained models to the Hugging Face Hub using a simple script.
- Load models pushed to the Hub using the standard
AutoModelandAutoProcessorclasses from thetransformerslibrary.
Architecture
The VLM consists of three main components:
- Visual Encoder: Extracts visual features from images. Supports various vision transformers (e.g., CLIP). Configurable via
model.visual_encoderin Hydra configs. - Language Model: Processes text and generates responses. Supports various Hugging Face language models. Configurable via
model.language_modelin Hydra configs. - Connector: Bridges the visual and language modalities. Supports different projection mechanisms (e.g., MLP). Configurable via
model.connectorin Hydra configs.
Setup and Installation
-
Environment Setup with uv: This project uses
uvfor Python environment and dependency management. For instructions on installinguvand setting up Python, please refer to installation.md. -
Install Dependencies: Once
uvis installed and you have cloned the repository, install the necessary dependencies:make install
Training
Training is managed via Hydra configurations and executed using DeepSpeed.
LLaVA Pretraining
To pretrain the LLaVA model, run:
deepspeed --module vlm -cn pretrain-llava
Customization
You can customize various aspects of the model and training process through Hydra configurations located in src/vlm/config/. This includes:
- Model Components:
model.visual_encoder.hf_name: Hugging Face name of the visual encoder.model.language_model.hf_name: Hugging Face name of the language model.model.connector.nameandmodel.connector.type: Define the type and specifics of the connector module.
- Training Parameters per Component:
trainer.unfreeze: Booleanstrain_vision_model,train_language_model,train_connectorto control which parts are trainable.trainer.learning_rate: Specific learning rates likevisual_encoder_learning_rate,language_model_learning_rate,connector_learning_rate.trainer.weight_decay: Specific weight decays likevisual_encoder_weight_decay,language_model_weight_decay,connector_weight_decay.
For example, to change the learning rate for the language model during finetuning, you could modify src/vlm/config/trainer/learning_rate/llava-finetune.yaml or override it via the command line:
deepspeed --module vlm -cn finetune-llava trainer.learning_rate.language_model_learning_rate=5e-6
Inference
You can refer to src/vlm/inference/eval.py
LLaVA Reproduction Results (Using lmms-eval)
| Task | Metric | Reproduced LLaVA (Value ± Stderr) | Original LLaVA (Value ± Stderr) |
|---|---|---|---|
| gqa | exact_match | 0.6201 ± 0.0043 | 0.6192 ± 0.0043 |
| mmbench_cn_cc | gpt_eval_score | 25.2941 ± N/A | 23.5294 ± N/A |
| mmbench_cn_dev | gpt_eval_score | 54.8969 ± N/A | 55.6701 ± N/A |
| mmbench_en_dev | gpt_eval_score | 66.0653 ± N/A | 64.0893 ± N/A |
| mmbench_ru_dev | gpt_eval_score | 54.9282 ± N/A | 53.0144 ± N/A |
| mme | mme_cognition_score | 321.4286 ± N/A | 355.7143 ± N/A |
| mme | mme_perception_score | 1505.4650 ± N/A | 1509.1289 ± N/A |
| scienceqa | exact_match | 0.6977 ± 0.0071 | 0.6572 ± 0.0073 |
| seedbench | seed_image | 0.6593 ± N/A | 0.6616 ± N/A |
| textvqa_val | exact_match | 0.4902 ± 0.0068 | 0.4600 ± 0.0068 |
| mmmu_val | mmmu_acc | 0.3789 ± N/A | 0.3611 ± N/A |
| ai2d | exact_match | 0.5379 ± 0.009 | 0.5518 ± 0.009 |
Pushing Models to Hugging Face Hub
This project provides a script to easily upload your trained models and processors to the Hugging Face Hub.
-
Run the push script: Execute the
push-to-hubcommand (which calls thepush_vlm_to_hubfunction):push-to-hub
The script will interactively ask for:
- Path to your pretrained/finetuned model checkpoint directory.
- The desired repository name on the Hub (e.g.,
your-username/your-model-name). - Whether to force push if the repository already exists.
-
Loading from Hub: Once pushed, your model can be loaded by anyone using the standard
transformerslibrary:from transformers import AutoModel, AutoProcessor repo_id = "your-username/your-model-name" model = AutoModel.from_pretrained(repo_id, trust_remote_code=True) processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True) # ... proceed with inference
The
push-to-hubscript automatically prepares the necessary configuration files (modeling_vlm.py,processing_vlm.py,configuration_vlm.py,connectors.py) and updatesconfig.jsonandprocessor_config.jsonto enable this seamless loading.
This project was built from simple-modern-uv, LLaVA, LLaVA-NEXT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file small_vlm-0.9.2.tar.gz.
File metadata
- Download URL: small_vlm-0.9.2.tar.gz
- Upload date:
- Size: 374.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1b6d4e50b3c17c7faf169a131f65fb59f123faf94aa4203ba3b632e061a0338
|
|
| MD5 |
d8680b3a4edbe9f89528806eedecf963
|
|
| BLAKE2b-256 |
8274f6d0a8becb1b67429c8c2e570e1cc55bd237a425d5757f79f088c664a43e
|
Provenance
The following attestation bundles were made for small_vlm-0.9.2.tar.gz:
Publisher:
publish.yml on leo1oel/small-vlm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
small_vlm-0.9.2.tar.gz -
Subject digest:
e1b6d4e50b3c17c7faf169a131f65fb59f123faf94aa4203ba3b632e061a0338 - Sigstore transparency entry: 217678031
- Sigstore integration time:
-
Permalink:
leo1oel/small-vlm@1a63a30ac73ad60417c1bada106255f0459e6c5d -
Branch / Tag:
refs/tags/v0.9.2 - Owner: https://github.com/leo1oel
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1a63a30ac73ad60417c1bada106255f0459e6c5d -
Trigger Event:
release
-
Statement type:
File details
Details for the file small_vlm-0.9.2-py3-none-any.whl.
File metadata
- Download URL: small_vlm-0.9.2-py3-none-any.whl
- Upload date:
- Size: 66.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bf7318db7f270836f480c90ef2f606047e86689cb90b898a7f7c14c572df852
|
|
| MD5 |
359150df860e465306c98606ade6c439
|
|
| BLAKE2b-256 |
24b67c1c1ad91cbd22328464c37d08f985ab04b8e367958af6f45cc85412981a
|
Provenance
The following attestation bundles were made for small_vlm-0.9.2-py3-none-any.whl:
Publisher:
publish.yml on leo1oel/small-vlm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
small_vlm-0.9.2-py3-none-any.whl -
Subject digest:
1bf7318db7f270836f480c90ef2f606047e86689cb90b898a7f7c14c572df852 - Sigstore transparency entry: 217678037
- Sigstore integration time:
-
Permalink:
leo1oel/small-vlm@1a63a30ac73ad60417c1bada106255f0459e6c5d -
Branch / Tag:
refs/tags/v0.9.2 - Owner: https://github.com/leo1oel
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1a63a30ac73ad60417c1bada106255f0459e6c5d -
Trigger Event:
release
-
Statement type: