Skip to main content

small vlm for training and experiments

Project description

small-vlm

Architecture

A flexible and configurable Vision Language Model (VLM) framework built with PyTorch, designed for experimentation and ease of use. This framework allows for modular replacement of core components and fine-grained control over training parameters.

Features

  • Modular Design: Easily swap out the Language Model (LLM), Visual Encoder, and Connector components to experiment with different architectures.
  • Configuration Management: Utilizes Hydra for robust and flexible configuration management, allowing you to define and override parameters easily.
  • Environment Setup: Uses uv for fast and reliable Python environment and package management.
  • Granular Training Control:
    • Independently set learning rates and weight decay for the LLM, visual encoder, and connector.
    • Independently freeze or unfreeze these components during different training stages.
  • LLaVA Implementation: Includes a straightforward reproduction of the LLaVA model (pretraining and finetuning).
  • Hugging Face Hub Integration:
    • Easily push your trained models to the Hugging Face Hub using a simple script.
    • Load models pushed to the Hub using the standard AutoModel and AutoProcessor classes from the transformers library.

Architecture

The VLM consists of three main components:

  • Visual Encoder: Extracts visual features from images. Supports various vision transformers (e.g., CLIP). Configurable via model.visual_encoder in Hydra configs.
  • Language Model: Processes text and generates responses. Supports various Hugging Face language models. Configurable via model.language_model in Hydra configs.
  • Connector: Bridges the visual and language modalities. Supports different projection mechanisms (e.g., MLP). Configurable via model.connector in Hydra configs.

Setup and Installation

  1. Environment Setup with uv: This project uses uv for Python environment and dependency management. For instructions on installing uv and setting up Python, please refer to installation.md.

  2. Install Dependencies: Once uv is installed and you have cloned the repository, install the necessary dependencies:

    make install
    

Training

Training is managed via Hydra configurations and executed using DeepSpeed.

LLaVA Pretraining

To pretrain the LLaVA model, run:

deepspeed --module vlm -cn pretrain-llava

Customization

You can customize various aspects of the model and training process through Hydra configurations located in src/vlm/config/. This includes:

  • Model Components:
    • model.visual_encoder.hf_name: Hugging Face name of the visual encoder.
    • model.language_model.hf_name: Hugging Face name of the language model.
    • model.connector.name and model.connector.type: Define the type and specifics of the connector module.
  • Training Parameters per Component:
    • trainer.unfreeze: Booleans train_vision_model, train_language_model, train_connector to control which parts are trainable.
    • trainer.learning_rate: Specific learning rates like visual_encoder_learning_rate, language_model_learning_rate, connector_learning_rate.
    • trainer.weight_decay: Specific weight decays like visual_encoder_weight_decay, language_model_weight_decay, connector_weight_decay.

For example, to change the learning rate for the language model during finetuning, you could modify src/vlm/config/trainer/learning_rate/llava-finetune.yaml or override it via the command line:

deepspeed --module vlm -cn finetune-llava trainer.learning_rate.language_model_learning_rate=5e-6

Inference

You can refer to src/vlm/inference/eval.py

LLaVA Reproduction Results (Using lmms-eval)

Task Metric Reproduced LLaVA (Value ± Stderr) Original LLaVA (Value ± Stderr)
gqa exact_match 0.6201 ± 0.0043 0.6192 ± 0.0043
mmbench_cn_cc gpt_eval_score 25.2941 ± N/A 23.5294 ± N/A
mmbench_cn_dev gpt_eval_score 54.8969 ± N/A 55.6701 ± N/A
mmbench_en_dev gpt_eval_score 66.0653 ± N/A 64.0893 ± N/A
mmbench_ru_dev gpt_eval_score 54.9282 ± N/A 53.0144 ± N/A
mme mme_cognition_score 321.4286 ± N/A 355.7143 ± N/A
mme mme_perception_score 1505.4650 ± N/A 1509.1289 ± N/A
scienceqa exact_match 0.6977 ± 0.0071 0.6572 ± 0.0073
seedbench seed_image 0.6593 ± N/A 0.6616 ± N/A
textvqa_val exact_match 0.4902 ± 0.0068 0.4600 ± 0.0068
mmmu_val mmmu_acc 0.3789 ± N/A 0.3611 ± N/A
ai2d exact_match 0.5379 ± 0.009 0.5518 ± 0.009

Pushing Models to Hugging Face Hub

This project provides a script to easily upload your trained models and processors to the Hugging Face Hub.

  1. Run the push script: Execute the push-to-hub command (which calls the push_vlm_to_hub function):

    push-to-hub
    

    The script will interactively ask for:

    • Path to your pretrained/finetuned model checkpoint directory.
    • The desired repository name on the Hub (e.g., your-username/your-model-name).
    • Whether to force push if the repository already exists.
  2. Loading from Hub: Once pushed, your model can be loaded by anyone using the standard transformers library:

    from transformers import AutoModel, AutoProcessor
    
    repo_id = "your-username/your-model-name"
    model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
    processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
    
    # ... proceed with inference
    

    The push-to-hub script automatically prepares the necessary configuration files (modeling_vlm.py, processing_vlm.py, configuration_vlm.py, connectors.py) and updates config.json and processor_config.json to enable this seamless loading.


This project was built from simple-modern-uv, LLaVA, LLaVA-NEXT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

small_vlm-0.9.2.tar.gz (374.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

small_vlm-0.9.2-py3-none-any.whl (66.5 kB view details)

Uploaded Python 3

File details

Details for the file small_vlm-0.9.2.tar.gz.

File metadata

  • Download URL: small_vlm-0.9.2.tar.gz
  • Upload date:
  • Size: 374.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for small_vlm-0.9.2.tar.gz
Algorithm Hash digest
SHA256 e1b6d4e50b3c17c7faf169a131f65fb59f123faf94aa4203ba3b632e061a0338
MD5 d8680b3a4edbe9f89528806eedecf963
BLAKE2b-256 8274f6d0a8becb1b67429c8c2e570e1cc55bd237a425d5757f79f088c664a43e

See more details on using hashes here.

Provenance

The following attestation bundles were made for small_vlm-0.9.2.tar.gz:

Publisher: publish.yml on leo1oel/small-vlm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file small_vlm-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: small_vlm-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 66.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for small_vlm-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1bf7318db7f270836f480c90ef2f606047e86689cb90b898a7f7c14c572df852
MD5 359150df860e465306c98606ade6c439
BLAKE2b-256 24b67c1c1ad91cbd22328464c37d08f985ab04b8e367958af6f45cc85412981a

See more details on using hashes here.

Provenance

The following attestation bundles were made for small_vlm-0.9.2-py3-none-any.whl:

Publisher: publish.yml on leo1oel/small-vlm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page