Skip to main content

A simple, unified multimodal models training engine. Lean, flexible, and built for hacking, and hacking at scale.

Project description

LMMs-Engine

A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.

Python 3.11+ PyTorch uv Code style: black Lint License GitHub stars

Quick StartExamplesModel SupportOptimizationsArchitectureDocumentation


Overview

LMMs Engine is a highly efficient, modular training framework for training Unified Multimodal Models at scale.

Train any multimodal model architectures including language models (Qwen series), vision-language models (Qwen2.5/3-VL, LLaVA-OV), diffusion models (dLLM, WanVideo series), unified multimodal models (Qwen2.5-Omni, BAGEL) and specialized research architectures (RAE, Linear Attn, SiT).

Built with distributed training optimizations (FSDP2 Multi-dimensional Parallelism, Ulysses Sequence Parallel, Flash Attention, Liger Kernel, Muon optimizer, Native Sparse Attention) and a modular design for easy extensibility.

Efficiency Report

[TBD] We will soon report the MFU metrics of following models.

  1. Qwen LLM series
  2. Qwen VLM series
  3. BAGEL
  4. dLLM

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/LMMs-Lab/lmms-engine.git
cd lmms-engine

# Install dependencies
uv sync

# Optional: Performance optimizations
uv pip install flash-attn --no-build-isolation
uv pip install liger-kernel

Launch Training

Recommended: torchrun (native PyTorch)

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \
  --master_addr=127.0.0.1 --master_port=12355 \
  -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

Alternative: Accelerate

accelerate launch --use_fsdp \
  -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

Single GPU

python -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

🔥 Featured Examples

Model Architecture FSDP2 Ulysses SP Muon Packing NSA Highlights Quick Start
BAGEL Vision+Generation TBD Unified visual understanding & generation run.sh
dLLM (Qwen3) Diffusion LM Masked diffusion language model run.sh
Qwen2.5-Omni Vision+Audio+Text Unified multimodal (image, audio, text) run.sh
Qwen2.5 Text Large Language Model run.sh
Qwen2.5-VL Vision+Text Multimodal Model run.sh
Qwen3-VL Vision-Language Native-resolution, long context (10K+ tokens) run.sh
RAE-SigLip Visual AutoEncoder Representation AutoEncoder, LPIPS, EMA run.sh
SiT Diffusion Transformer Interpolant Transformer, CFG, ImageNet-1K run.sh
WanVideo Video Generation T2V/I2V/V2V generation (1.3B/14B) run.sh
FLA models Liear Attn Models Efficient architecture, FineWeb-Edu pretraining run.sh

Optimization Legend:

  • FSDP2: Fully Sharded Data Parallel v2 for distributed training
  • Ulysses SP: Sequence Parallel for long contexts
  • Muon: Advanced optimizer with Newton-Schulz orthogonalization
  • Packing: First-fit bin packing for peaking at 35-40% MFU vs 20-25% (w/o in Qwen2.5-VL finetuning)
  • NSA: Native Sparse Attention for efficient long-context processing

💡 Tip: Each run.sh file contains detailed setup instructions, prerequisites, and configuration options.

🤖 Model Support

19+ architectures spanning vision-language, diffusion, and language models.

Multimodal Models

  • Qwen2.5-VL - SOTA level performance vision-language model
  • Qwen3-VL - SOTA level performance vision-language model
  • Qwen2.5-Omni - Unified vision + audio + text modalities
  • LLaVA-OneVision - Fully open-source vision-language model
  • Bagel - Unified multimodal model for visual understanding and generation
  • Aero - Lightweight audio-language model

Diffusion & Generative Models

  • dLLM (Qwen3) - Diffusion Language Model with masked prediction
  • WanVideo (1.3B/14B) - Text/Image-to-Video generation (T2V/I2V/V2V)
  • SiT (XL/2) - Scalable Interpolant Transformers for class-conditional image generation
  • RAE-SigLip - Representation AutoEncoder with adversarial discriminator

Language Models

  • Qwen2/2.5/3 series - Full Liger kernel support with fused operations
  • Linear Attention Models - Recurrent architecture optimized for Muon; Please Install FLA first.
  • Custom architectures - Extensible via @register_model() decorator

⚡️ Optimizations

Production-grade efficiency from distributed training to kernel fusion.

Core Distributed Training

  • FSDP2 - PyTorch 2.0+ DTensor-based sharding for parameters, gradients, and optimizer states. Improved composability over original FSDP enables flexible parallelism composition.

  • Ulysses Sequence Parallel - Splits sequence dimension across GPUs for ultra-long contexts. Critical for vision-language models like Qwen3-VL with 10K+ visual tokens.

  • Multi-dimensional Parallelism (TODO) - Compose TP × CP × PP × DP meshes for cluster-scale training.

Memory & Compute Optimizations

  • Flash Attention + Unpadding - Tiled attention with use_rmpad eliminates all padding computation. 2-3× speedup on variable-length sequences.

  • Native Sparse Attention (NSA) - Hybrid attention mechanism combining compressed attention, topk sparse attention, and sliding window attention. Enables efficient long-context processing for BAGEL model with reduced memory footprint.

  • Liger Kernel - Triton fused kernels (CrossEntropy, RMSNorm, RoPE, SwiGLU) achieve 30% memory reduction by avoiding intermediate materializations.

  • Monkey Patching System - Runtime kernel injection via lmms_engine/configs/monkey_patch/ for model-specific optimizations without code modification.

  • Sequence Packing - First-fit bin packing achieves 35-40% MFU vs 20-25% without packing. Combined with unpadding for zero padding waste.

Advanced Optimizer

  • Muon Optimizer - Newton-Schulz orthogonalization with Triton kernels, distributed via DTensor. Selective 2D-parameter application outperforms AdamW convergence.

Data Pipeline

  • Streaming Datasets - IterableDataset for trillion-token pretraining without full data loading.

Configuration Examples

Sequence Packing - with full unpadding
dataset_config:
  packing: true
  packing_strategy: first_fit
  packing_length: 32000

trainer_args:
  use_rmpad: true  # Requires flash-attn
  use_liger_kernel: true
Liger Kernel - Enable LinkedIn's Triton kernels for 30% memory reduction
trainer_args:
  use_liger_kernel: true

Fused operations:

  • CrossEntropy (major memory savings)
  • RMSNorm, RoPE, SwiGLU
  • Automatically applied via monkey patching
Muon Optimizer - State-of-the-art optimizer for LLMs
trainer_args:
  use_muon: true # enable muonwithadam optimizer
  adam_beta1: 0.9 # for the adam part in muonwithadam optimizer
  adam_beta2: 0.999 # for the adam part in muonwithadam optimizer
  adam_epsilon: 1.0e-8 # for the adam part in muonwithadam optimizer
  learning_rate: 0.001
  weight_decay: 0.01
  # ns_steps: 5  # Newton-Schulz iterations (default)

  # for some modules which the user hope to 

Features:

  • Newton-Schulz orthogonalization with Triton kernels
  • Distributed via DTensor (FSDP2)
  • Selective 2D parameter application

Note If users wish to specify whether a module should be optimized using Muon or Adam, they can designate this in lmms_engine.train.hf.trainer.create_optimizer. By default, modules excluded from Muon optimization include those containing the following substrings in their names: ["emb", "norm", "lm_head", "bias", "wte", "wpe", "output", "a_proj", "b_proj", "conv1d", "rotary"] as well as any parameters whose dimension does not equal 2.

FSDP2 Configuration
trainer_args:
  fsdp2: true
  fsdp_config:
    transformer_layer_cls_to_wrap: ["Qwen2VLDecoderLayer"]
    reshard_after_forward: false
    activation_checkpointing: true
Ulysses Sequence Parallel - For long-sequence VLMs
trainer_args:
  sp_ulysses_degree: 2  # Sequence parallel degree

Benefits:

  • Splits sequence length across GPUs
  • Reduces memory footprint for long contexts
  • Works with Flash Attention
Native Sparse Attention (NSA) - Efficient long-context attention for BAGEL
model_config:
  load_from_pretrained_path: "lmms-lab/BAGEL-7B-MoT-ver.LE"

monkey_patch:
  - type: nsa
    model_type: bagel
    kwargs:
      block_size: 64
      compress_type: "weightedpool"  # weightedpool, linear, avgpool
      kernel_size: 32
      kernel_stride: 16
      topk: 16
      init_blocks: 1
      local_blocks: 2
      window_size: 512

Features:

  • Compressed attention with key-value compression
  • TopK sparse attention for efficiency
  • Sliding window attention for local context
  • Hybrid mechanism combines all three attention types
  • Requires: pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.git

Note: Currently only supported for BAGEL model.

📖 Documentation

Step-by-Step Workflow

  1. Process the dataset into OpenAI chat format (JSONL/JSON/Arrow/CSV)

    hf download kcz358/open-thoughts-debug --local-dir data/open_thoughts_debug --repo-type dataset
    
  2. Prepare dataset YAML (optional for single data source)

    datasets:
      - path: data/open_thoughts_debug
        data_folder: ""
        data_type: arrow
    
  3. Configure training - See examples/qwen3_vl/example_config.yaml or any model-specific config in examples/

Comprehensive Guides

Getting Started:

Advanced Topics:

🏗️ Codebase Architecture

Component Registry

Factory Pattern enables easy extensibility:

# Register a custom dataset
from lmms_engine.datasets import register_dataset, BaseDataset

@register_dataset("my_custom_dataset")
class MyCustomDataset(BaseDataset):
    def __init__(self, config):
        super().__init__(config)
        # Custom initialization

    def __getitem__(self, idx):
        # Custom data loading
        return item

# Register a custom processor
from lmms_engine.datasets.processor import register_processor

@register_processor("my_custom_processor")
class MyCustomProcessor:
    def __call__(self, raw_data):
        # Custom processing
        return processed_data

Training Pipeline

Builder Pattern for flexible composition:

from lmms_engine.train import TrainRunner

# Configuration defines the pipeline
runner = TrainRunner(config)
runner.build()  # Lazy initialization of components
runner.run()    # Execute training

Pipeline stages:

  1. Model initialization - From pretrained or config
  2. Dataset creation - With processor and collator
  3. Monkey patching - Apply kernel optimizations
  4. Trainer setup - FSDP2, DeepSpeed, or custom
  5. Training execution - With checkpointing and logging

Supported Trainers

Trainer Type Use Case Key Features
hf_trainer General VLM/LM training FSDP2, Muon, Liger, Flash Attn
dllm_trainer Diffusion language models Masked LM, custom loss, DLLM collator
wan_trainer Video generation Flow-matching, multi-modal inputs
rae_trainer Visual autoencoders Adversarial loss, EMA, LPIPS
sit_trainer Diffusion transformers Interpolant framework, CFG, EMA

🎯 Use Cases

  • Vision-Language Pretraining - Qwen-VL, LLaVA on large multimodal datasets
  • Video Understanding - AERO on 3D video data
  • Diffusion Models - DLLM, SiT, WanVideo for generation tasks
  • Representation Learning - RAE for visual representations
  • Language Model Pretraining - DGN, Qwen with Muon optimizer
  • Multimodal Fine-tuning - Efficient SFT with sequence packing

🤝 Contributing

We welcome contributions! Please see our Design Principles for coding guidelines:

  • Simplicity: Write simple, straightforward code
  • Readability: Prioritize clarity over cleverness
  • Testability: Create testable components
  • Minimal Changes: Only modify code related to the task
  • Less Code = Less Debt: Minimize code footprint

📝 Citation

If you use LMMs Engine in your research, please cite:

@software{lmms_engine2024,
  title={LMMs Engine: A simple, unified multimodal framework for pretraining and finetuning.},
  author={LMMs-Lab},
  year={2024},
  url={https://github.com/LMMs-Lab/lmms-engine}
}

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

🔗 Links


Built with ❤️ by LMMs-Lab

Star us on GitHub to support the project!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmms_engine-0.1.2.tar.gz (535.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmms_engine-0.1.2-py3-none-any.whl (355.5 kB view details)

Uploaded Python 3

File details

Details for the file lmms_engine-0.1.2.tar.gz.

File metadata

  • Download URL: lmms_engine-0.1.2.tar.gz
  • Upload date:
  • Size: 535.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for lmms_engine-0.1.2.tar.gz
Algorithm Hash digest
SHA256 efbf0fc0b8afdf6a6759b66a52dcadac3c58521ca14af6027f3a0cbcfe0e0796
MD5 36a8151b62f3879274a77cbe2813252c
BLAKE2b-256 aace93de60f3686ffed28a58135b2acbf22d10ffb5674867b1fd3144e2c34872

See more details on using hashes here.

File details

Details for the file lmms_engine-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for lmms_engine-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3a6b7221fc6968a095377e0e8872e65f91236f5a195080249e8a85cca33b9356
MD5 dc0d64292151ca3cddb514dbbabf0e4a
BLAKE2b-256 367bf94ae04873bf7141cf81c6e6000a967d9a87e664d99c939ad746e0cf0a93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page