A simple, unified multimodal models training engine. Lean, flexible, and built for hacking, and hacking at scale.
Project description
LMMs-Engine
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.
Quick Start • Examples • Model Support • Optimizations • Architecture • Documentation
Overview
LMMs Engine is a highly efficient, modular training framework for training Unified Multimodal Models at scale.
Train any multimodal model architectures including language models (Qwen series), vision-language models (Qwen2.5/3-VL, LLaVA-OV), diffusion models (dLLM, WanVideo series), unified multimodal models (Qwen2.5-Omni, BAGEL) and specialized research architectures (RAE, Linear Attn, SiT).
Built with distributed training optimizations (FSDP2 Multi-dimensional Parallelism, Ulysses Sequence Parallel, Flash Attention, Liger Kernel, Muon optimizer, Native Sparse Attention) and a modular design for easy extensibility.
Efficiency Report
[TBD] We will soon report the MFU metrics of following models.
- Qwen LLM series
- Qwen VLM series
- BAGEL
- dLLM
🚀 Quick Start
Installation
# Clone the repository
git clone https://github.com/LMMs-Lab/lmms-engine.git
cd lmms-engine
# Install dependencies
uv sync
# Optional: Performance optimizations
uv pip install flash-attn --no-build-isolation
uv pip install liger-kernel
Launch Training
Recommended: torchrun (native PyTorch)
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \
--master_addr=127.0.0.1 --master_port=12355 \
-m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml
Alternative: Accelerate
accelerate launch --use_fsdp \
-m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml
Single GPU
python -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml
🔥 Featured Examples
| Model | Architecture | FSDP2 | Ulysses SP | Muon | Packing | NSA | Highlights | Quick Start |
|---|---|---|---|---|---|---|---|---|
| BAGEL | Vision+Generation | ✅ | TBD | ✅ | ✅ | ✅ | Unified visual understanding & generation | run.sh |
| dLLM (Qwen3) | Diffusion LM | ✅ | ❌ | ✅ | ❌ | ❌ | Masked diffusion language model | run.sh |
| Qwen2.5-Omni | Vision+Audio+Text | ✅ | ✅ | ✅ | ✅ | ❌ | Unified multimodal (image, audio, text) | run.sh |
| Qwen2.5 | Text | ✅ | ✅ | ✅ | ✅ | ❌ | Large Language Model | run.sh |
| Qwen2.5-VL | Vision+Text | ✅ | ✅ | ✅ | ✅ | ❌ | Multimodal Model | run.sh |
| Qwen3-VL | Vision-Language | ✅ | ✅ | ✅ | ✅ | ❌ | Native-resolution, long context (10K+ tokens) | run.sh |
| RAE-SigLip | Visual AutoEncoder | ✅ | ❌ | ✅ | ❌ | ❌ | Representation AutoEncoder, LPIPS, EMA | run.sh |
| SiT | Diffusion Transformer | ✅ | ❌ | ✅ | ❌ | ❌ | Interpolant Transformer, CFG, ImageNet-1K | run.sh |
| WanVideo | Video Generation | ✅ | ❌ | ✅ | ❌ | ❌ | T2V/I2V/V2V generation (1.3B/14B) | run.sh |
| FLA models | Liear Attn Models | ✅ | ❌ | ✅ | ✅ | ❌ | Efficient architecture, FineWeb-Edu pretraining | run.sh |
Optimization Legend:
- FSDP2: Fully Sharded Data Parallel v2 for distributed training
- Ulysses SP: Sequence Parallel for long contexts
- Muon: Advanced optimizer with Newton-Schulz orthogonalization
- Packing: First-fit bin packing for peaking at 35-40% MFU vs 20-25% (w/o in Qwen2.5-VL finetuning)
- NSA: Native Sparse Attention for efficient long-context processing
💡 Tip: Each
run.shfile contains detailed setup instructions, prerequisites, and configuration options.
🤖 Model Support
19+ architectures spanning vision-language, diffusion, and language models.
Multimodal Models
- Qwen2.5-VL - SOTA level performance vision-language model
- Qwen3-VL - SOTA level performance vision-language model
- Qwen2.5-Omni - Unified vision + audio + text modalities
- LLaVA-OneVision - Fully open-source vision-language model
- Bagel - Unified multimodal model for visual understanding and generation
- Aero - Lightweight audio-language model
Diffusion & Generative Models
- dLLM (Qwen3) - Diffusion Language Model with masked prediction
- WanVideo (1.3B/14B) - Text/Image-to-Video generation (T2V/I2V/V2V)
- SiT (XL/2) - Scalable Interpolant Transformers for class-conditional image generation
- RAE-SigLip - Representation AutoEncoder with adversarial discriminator
Language Models
- Qwen2/2.5/3 series - Full Liger kernel support with fused operations
- Linear Attention Models - Recurrent architecture optimized for Muon; Please Install FLA first.
- Custom architectures - Extensible via
@register_model()decorator
⚡️ Optimizations
Production-grade efficiency from distributed training to kernel fusion.
Core Distributed Training
-
FSDP2 - PyTorch 2.0+ DTensor-based sharding for parameters, gradients, and optimizer states. Improved composability over original FSDP enables flexible parallelism composition.
-
Ulysses Sequence Parallel - Splits sequence dimension across GPUs for ultra-long contexts. Critical for vision-language models like Qwen3-VL with 10K+ visual tokens.
-
Multi-dimensional Parallelism (TODO) - Compose TP × CP × PP × DP meshes for cluster-scale training.
Memory & Compute Optimizations
-
Flash Attention + Unpadding - Tiled attention with
use_rmpadeliminates all padding computation. 2-3× speedup on variable-length sequences. -
Native Sparse Attention (NSA) - Hybrid attention mechanism combining compressed attention, topk sparse attention, and sliding window attention. Enables efficient long-context processing for BAGEL model with reduced memory footprint.
-
Liger Kernel - Triton fused kernels (CrossEntropy, RMSNorm, RoPE, SwiGLU) achieve 30% memory reduction by avoiding intermediate materializations.
-
Monkey Patching System - Runtime kernel injection via
lmms_engine/configs/monkey_patch/for model-specific optimizations without code modification. -
Sequence Packing - First-fit bin packing achieves 35-40% MFU vs 20-25% without packing. Combined with unpadding for zero padding waste.
Advanced Optimizer
- Muon Optimizer - Newton-Schulz orthogonalization with Triton kernels, distributed via DTensor. Selective 2D-parameter application outperforms AdamW convergence.
Data Pipeline
- Streaming Datasets -
IterableDatasetfor trillion-token pretraining without full data loading.
Configuration Examples
Sequence Packing - with full unpadding
dataset_config:
packing: true
packing_strategy: first_fit
packing_length: 32000
trainer_args:
use_rmpad: true # Requires flash-attn
use_liger_kernel: true
Liger Kernel - Enable LinkedIn's Triton kernels for 30% memory reduction
trainer_args:
use_liger_kernel: true
Fused operations:
- CrossEntropy (major memory savings)
- RMSNorm, RoPE, SwiGLU
- Automatically applied via monkey patching
Muon Optimizer - State-of-the-art optimizer for LLMs
trainer_args:
use_muon: true # enable muonwithadam optimizer
adam_beta1: 0.9 # for the adam part in muonwithadam optimizer
adam_beta2: 0.999 # for the adam part in muonwithadam optimizer
adam_epsilon: 1.0e-8 # for the adam part in muonwithadam optimizer
learning_rate: 0.001
weight_decay: 0.01
# ns_steps: 5 # Newton-Schulz iterations (default)
# for some modules which the user hope to
Features:
- Newton-Schulz orthogonalization with Triton kernels
- Distributed via DTensor (FSDP2)
- Selective 2D parameter application
Note
If users wish to specify whether a module should be optimized using Muon or Adam, they can designate this in lmms_engine.train.hf.trainer.create_optimizer. By default, modules excluded from Muon optimization include those containing the following substrings in their names: ["emb", "norm", "lm_head", "bias", "wte", "wpe", "output", "a_proj", "b_proj", "conv1d", "rotary"]
as well as any parameters whose dimension does not equal 2.
FSDP2 Configuration
trainer_args:
fsdp2: true
fsdp_config:
transformer_layer_cls_to_wrap: ["Qwen2VLDecoderLayer"]
reshard_after_forward: false
activation_checkpointing: true
Ulysses Sequence Parallel - For long-sequence VLMs
trainer_args:
sp_ulysses_degree: 2 # Sequence parallel degree
Benefits:
- Splits sequence length across GPUs
- Reduces memory footprint for long contexts
- Works with Flash Attention
Native Sparse Attention (NSA) - Efficient long-context attention for BAGEL
model_config:
load_from_pretrained_path: "lmms-lab/BAGEL-7B-MoT-ver.LE"
monkey_patch:
- type: nsa
model_type: bagel
kwargs:
block_size: 64
compress_type: "weightedpool" # weightedpool, linear, avgpool
kernel_size: 32
kernel_stride: 16
topk: 16
init_blocks: 1
local_blocks: 2
window_size: 512
Features:
- Compressed attention with key-value compression
- TopK sparse attention for efficiency
- Sliding window attention for local context
- Hybrid mechanism combines all three attention types
- Requires:
pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.git
Note: Currently only supported for BAGEL model.
📖 Documentation
Step-by-Step Workflow
-
Process the dataset into OpenAI chat format (JSONL/JSON/Arrow/CSV)
hf download kcz358/open-thoughts-debug --local-dir data/open_thoughts_debug --repo-type dataset
-
Prepare dataset YAML (optional for single data source)
datasets: - path: data/open_thoughts_debug data_folder: "" data_type: arrow
-
Configure training - See examples/qwen3_vl/example_config.yaml or any model-specific config in examples/
Comprehensive Guides
Getting Started:
- Dataset Preparation - How to prepare and structure your data
- Dataset & Packing Guide - Detailed dataset implementations and packing strategies
- Training Guide - Comprehensive training walkthrough
Advanced Topics:
- Design Principles - Architectural patterns and philosophy
- API Reference - Detailed API documentation
🏗️ Codebase Architecture
Component Registry
Factory Pattern enables easy extensibility:
# Register a custom dataset
from lmms_engine.datasets import register_dataset, BaseDataset
@register_dataset("my_custom_dataset")
class MyCustomDataset(BaseDataset):
def __init__(self, config):
super().__init__(config)
# Custom initialization
def __getitem__(self, idx):
# Custom data loading
return item
# Register a custom processor
from lmms_engine.datasets.processor import register_processor
@register_processor("my_custom_processor")
class MyCustomProcessor:
def __call__(self, raw_data):
# Custom processing
return processed_data
Training Pipeline
Builder Pattern for flexible composition:
from lmms_engine.train import TrainRunner
# Configuration defines the pipeline
runner = TrainRunner(config)
runner.build() # Lazy initialization of components
runner.run() # Execute training
Pipeline stages:
- Model initialization - From pretrained or config
- Dataset creation - With processor and collator
- Monkey patching - Apply kernel optimizations
- Trainer setup - FSDP2, DeepSpeed, or custom
- Training execution - With checkpointing and logging
Supported Trainers
| Trainer Type | Use Case | Key Features |
|---|---|---|
hf_trainer |
General VLM/LM training | FSDP2, Muon, Liger, Flash Attn |
dllm_trainer |
Diffusion language models | Masked LM, custom loss, DLLM collator |
wan_trainer |
Video generation | Flow-matching, multi-modal inputs |
rae_trainer |
Visual autoencoders | Adversarial loss, EMA, LPIPS |
sit_trainer |
Diffusion transformers | Interpolant framework, CFG, EMA |
🎯 Use Cases
- Vision-Language Pretraining - Qwen-VL, LLaVA on large multimodal datasets
- Video Understanding - AERO on 3D video data
- Diffusion Models - DLLM, SiT, WanVideo for generation tasks
- Representation Learning - RAE for visual representations
- Language Model Pretraining - DGN, Qwen with Muon optimizer
- Multimodal Fine-tuning - Efficient SFT with sequence packing
🤝 Contributing
We welcome contributions! Please see our Design Principles for coding guidelines:
- Simplicity: Write simple, straightforward code
- Readability: Prioritize clarity over cleverness
- Testability: Create testable components
- Minimal Changes: Only modify code related to the task
- Less Code = Less Debt: Minimize code footprint
📝 Citation
If you use LMMs Engine in your research, please cite:
@software{lmms_engine2024,
title={LMMs Engine: A simple, unified multimodal framework for pretraining and finetuning.},
author={LMMs-Lab},
year={2024},
url={https://github.com/LMMs-Lab/lmms-engine}
}
📄 License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
🔗 Links
- GitHub: https://github.com/LMMs-Lab/lmms-engine
- LMMs-Lab: https://lmms-lab.com
- Documentation: docs/
- Issues: https://github.com/LMMs-Lab/lmms-engine/issues
Built with ❤️ by LMMs-Lab
⭐ Star us on GitHub to support the project! ⭐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lmms_engine-0.1.2.tar.gz.
File metadata
- Download URL: lmms_engine-0.1.2.tar.gz
- Upload date:
- Size: 535.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efbf0fc0b8afdf6a6759b66a52dcadac3c58521ca14af6027f3a0cbcfe0e0796
|
|
| MD5 |
36a8151b62f3879274a77cbe2813252c
|
|
| BLAKE2b-256 |
aace93de60f3686ffed28a58135b2acbf22d10ffb5674867b1fd3144e2c34872
|
File details
Details for the file lmms_engine-0.1.2-py3-none-any.whl.
File metadata
- Download URL: lmms_engine-0.1.2-py3-none-any.whl
- Upload date:
- Size: 355.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a6b7221fc6968a095377e0e8872e65f91236f5a195080249e8a85cca33b9356
|
|
| MD5 |
dc0d64292151ca3cddb514dbbabf0e4a
|
|
| BLAKE2b-256 |
367bf94ae04873bf7141cf81c6e6000a967d9a87e664d99c939ad746e0cf0a93
|