A simple, unified multimodal models training engine. Lean, flexible, and built for hacking, and hacking at scale.

These details have not been verified by PyPI

Project links

Project description

LMMs-Engine

A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.

Quick Start • Examples • Model Support • Optimizations • Architecture • Documentation

Overview

LMMs Engine is a highly efficient, modular training framework for training Unified Multimodal Models at scale.

Train any multimodal model architectures including language models (Qwen series), vision-language models (Qwen2.5/3-VL, LLaVA-OV), diffusion models (dLLM, WanVideo series), unified multimodal models (Qwen2.5-Omni, BAGEL) and specialized research architectures (RAE, Linear Attn, SiT).

Built with distributed training optimizations (FSDP2 Multi-dimensional Parallelism, Ulysses Sequence Parallel, Flash Attention, Liger Kernel, Muon optimizer, Native Sparse Attention) and a modular design for easy extensibility.

Efficiency Report

[TBD] We will soon report the MFU metrics of following models.

Qwen LLM series
Qwen VLM series
BAGEL
dLLM

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/LMMs-Lab/lmms-engine.git
cd lmms-engine

# Install dependencies
uv sync

# Optional: Performance optimizations
uv pip install flash-attn --no-build-isolation
uv pip install liger-kernel

Launch Training

Recommended: torchrun (native PyTorch)

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \
  --master_addr=127.0.0.1 --master_port=12355 \
  -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

Alternative: Accelerate

accelerate launch --use_fsdp \
  -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

Single GPU

python -m lmms_engine.launch.cli config_yaml=examples/qwen3_vl/example_config.yaml

🔥 Featured Examples

Model	Architecture	FSDP2	Ulysses SP	Muon	Packing	NSA	Highlights	Quick Start
BAGEL	Vision+Generation	✅	TBD	✅	✅	✅	Unified visual understanding & generation	run.sh
dLLM (Qwen3)	Diffusion LM	✅	❌	✅	❌	❌	Masked diffusion language model	run.sh
Qwen2.5-Omni	Vision+Audio+Text	✅	✅	✅	✅	❌	Unified multimodal (image, audio, text)	run.sh
Qwen2.5	Text	✅	✅	✅	✅	❌	Large Language Model	run.sh
Qwen2.5-VL	Vision+Text	✅	✅	✅	✅	❌	Multimodal Model	run.sh
Qwen3-VL	Vision-Language	✅	✅	✅	✅	❌	Native-resolution, long context (10K+ tokens)	run.sh
RAE-SigLip	Visual AutoEncoder	✅	❌	✅	❌	❌	Representation AutoEncoder, LPIPS, EMA	run.sh
SiT	Diffusion Transformer	✅	❌	✅	❌	❌	Interpolant Transformer, CFG, ImageNet-1K	run.sh
WanVideo	Video Generation	✅	❌	✅	❌	❌	T2V/I2V/V2V generation (1.3B/14B)	run.sh
FLA models	Liear Attn Models	✅	❌	✅	✅	❌	Efficient architecture, FineWeb-Edu pretraining	run.sh

Optimization Legend:

FSDP2: Fully Sharded Data Parallel v2 for distributed training
Ulysses SP: Sequence Parallel for long contexts
Muon: Advanced optimizer with Newton-Schulz orthogonalization
Packing: First-fit bin packing for peaking at 35-40% MFU vs 20-25% (w/o in Qwen2.5-VL finetuning)
NSA: Native Sparse Attention for efficient long-context processing

💡 Tip: Each run.sh file contains detailed setup instructions, prerequisites, and configuration options.

🤖 Model Support

19+ architectures spanning vision-language, diffusion, and language models.

Multimodal Models

Qwen2.5-VL - SOTA level performance vision-language model
Qwen3-VL - SOTA level performance vision-language model
Qwen2.5-Omni - Unified vision + audio + text modalities
LLaVA-OneVision - Fully open-source vision-language model
Bagel - Unified multimodal model for visual understanding and generation
Aero - Lightweight audio-language model

Diffusion & Generative Models

dLLM (Qwen3) - Diffusion Language Model with masked prediction
WanVideo (1.3B/14B) - Text/Image-to-Video generation (T2V/I2V/V2V)
SiT (XL/2) - Scalable Interpolant Transformers for class-conditional image generation
RAE-SigLip - Representation AutoEncoder with adversarial discriminator

Language Models

Qwen2/2.5/3 series - Full Liger kernel support with fused operations
Linear Attention Models - Recurrent architecture optimized for Muon; Please Install FLA first.
Custom architectures - Extensible via @register_model() decorator

⚡️ Optimizations

Production-grade efficiency from distributed training to kernel fusion.

Core Distributed Training

FSDP2 - PyTorch 2.0+ DTensor-based sharding for parameters, gradients, and optimizer states. Improved composability over original FSDP enables flexible parallelism composition.
Ulysses Sequence Parallel - Splits sequence dimension across GPUs for ultra-long contexts. Critical for vision-language models like Qwen3-VL with 10K+ visual tokens.
Multi-dimensional Parallelism (TODO) - Compose TP × CP × PP × DP meshes for cluster-scale training.

Memory & Compute Optimizations

Flash Attention + Unpadding - Tiled attention with use_rmpad eliminates all padding computation. 2-3× speedup on variable-length sequences.
Native Sparse Attention (NSA) - Hybrid attention mechanism combining compressed attention, topk sparse attention, and sliding window attention. Enables efficient long-context processing for BAGEL model with reduced memory footprint.
Liger Kernel - Triton fused kernels (CrossEntropy, RMSNorm, RoPE, SwiGLU) achieve 30% memory reduction by avoiding intermediate materializations.
Monkey Patching System - Runtime kernel injection via lmms_engine/configs/monkey_patch/ for model-specific optimizations without code modification.
Sequence Packing - First-fit bin packing achieves 35-40% MFU vs 20-25% without packing. Combined with unpadding for zero padding waste.

Advanced Optimizer

Muon Optimizer - Newton-Schulz orthogonalization with Triton kernels, distributed via DTensor. Selective 2D-parameter application outperforms AdamW convergence.

Data Pipeline

Streaming Datasets - IterableDataset for trillion-token pretraining without full data loading.

Configuration Examples

Sequence Packing - with full unpadding

dataset_config:
  packing: true
  packing_strategy: first_fit
  packing_length: 32000

trainer_args:
  use_rmpad: true  # Requires flash-attn
  use_liger_kernel: true

Liger Kernel - Enable LinkedIn's Triton kernels for 30% memory reduction

trainer_args:
  use_liger_kernel: true

Fused operations:

CrossEntropy (major memory savings)
RMSNorm, RoPE, SwiGLU
Automatically applied via monkey patching

Muon Optimizer - State-of-the-art optimizer for LLMs

trainer_args:
  use_muon: true # enable muonwithadam optimizer
  adam_beta1: 0.9 # for the adam part in muonwithadam optimizer
  adam_beta2: 0.999 # for the adam part in muonwithadam optimizer
  adam_epsilon: 1.0e-8 # for the adam part in muonwithadam optimizer
  learning_rate: 0.001
  weight_decay: 0.01
  # ns_steps: 5  # Newton-Schulz iterations (default)

  # for some modules which the user hope to

Features:

Newton-Schulz orthogonalization with Triton kernels
Distributed via DTensor (FSDP2)
Selective 2D parameter application

Note If users wish to specify whether a module should be optimized using Muon or Adam, they can designate this in lmms_engine.train.hf.trainer.create_optimizer. By default, modules excluded from Muon optimization include those containing the following substrings in their names: ["emb", "norm", "lm_head", "bias", "wte", "wpe", "output", "a_proj", "b_proj", "conv1d", "rotary"] as well as any parameters whose dimension does not equal 2.

FSDP2 Configuration

trainer_args:
  fsdp2: true
  fsdp_config:
    transformer_layer_cls_to_wrap: ["Qwen2VLDecoderLayer"]
    reshard_after_forward: false
    activation_checkpointing: true

Ulysses Sequence Parallel - For long-sequence VLMs

trainer_args:
  sp_ulysses_degree: 2  # Sequence parallel degree

Benefits:

Splits sequence length across GPUs
Reduces memory footprint for long contexts
Works with Flash Attention

Native Sparse Attention (NSA) - Efficient long-context attention for BAGEL

model_config:
  load_from_pretrained_path: "lmms-lab/BAGEL-7B-MoT-ver.LE"

monkey_patch:
  - type: nsa
    model_type: bagel
    kwargs:
      block_size: 64
      compress_type: "weightedpool"  # weightedpool, linear, avgpool
      kernel_size: 32
      kernel_stride: 16
      topk: 16
      init_blocks: 1
      local_blocks: 2
      window_size: 512

Features:

Compressed attention with key-value compression
TopK sparse attention for efficiency
Sliding window attention for local context
Hybrid mechanism combines all three attention types
Requires: pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.git

Note: Currently only supported for BAGEL model.

📖 Documentation

Step-by-Step Workflow

Process the dataset into OpenAI chat format (JSONL/JSON/Arrow/CSV)

hf download kcz358/open-thoughts-debug --local-dir data/open_thoughts_debug --repo-type dataset

Prepare dataset YAML (optional for single data source)

datasets:
  - path: data/open_thoughts_debug
    data_folder: ""
    data_type: arrow

Configure training - See examples/qwen3_vl/example_config.yaml or any model-specific config in examples/

Comprehensive Guides

Getting Started:

Dataset Preparation - How to prepare and structure your data
Dataset & Packing Guide - Detailed dataset implementations and packing strategies
Training Guide - Comprehensive training walkthrough

Advanced Topics:

Design Principles - Architectural patterns and philosophy
API Reference - Detailed API documentation

🏗️ Codebase Architecture

Component Registry

Factory Pattern enables easy extensibility:

# Register a custom dataset
from lmms_engine.datasets import register_dataset, BaseDataset

@register_dataset("my_custom_dataset")
class MyCustomDataset(BaseDataset):
    def __init__(self, config):
        super().__init__(config)
        # Custom initialization

    def __getitem__(self, idx):
        # Custom data loading
        return item

# Register a custom processor
from lmms_engine.datasets.processor import register_processor

@register_processor("my_custom_processor")
class MyCustomProcessor:
    def __call__(self, raw_data):
        # Custom processing
        return processed_data

Training Pipeline

Builder Pattern for flexible composition:

from lmms_engine.train import TrainRunner

# Configuration defines the pipeline
runner = TrainRunner(config)
runner.build()  # Lazy initialization of components
runner.run()    # Execute training

Pipeline stages:

Model initialization - From pretrained or config
Dataset creation - With processor and collator
Monkey patching - Apply kernel optimizations
Trainer setup - FSDP2, DeepSpeed, or custom
Training execution - With checkpointing and logging

Supported Trainers

Trainer Type	Use Case	Key Features
`hf_trainer`	General VLM/LM training	FSDP2, Muon, Liger, Flash Attn
`dllm_trainer`	Diffusion language models	Masked LM, custom loss, DLLM collator
`wan_trainer`	Video generation	Flow-matching, multi-modal inputs
`rae_trainer`	Visual autoencoders	Adversarial loss, EMA, LPIPS
`sit_trainer`	Diffusion transformers	Interpolant framework, CFG, EMA

🎯 Use Cases

Vision-Language Pretraining - Qwen-VL, LLaVA on large multimodal datasets
Video Understanding - AERO on 3D video data
Diffusion Models - DLLM, SiT, WanVideo for generation tasks
Representation Learning - RAE for visual representations
Language Model Pretraining - DGN, Qwen with Muon optimizer
Multimodal Fine-tuning - Efficient SFT with sequence packing

🤝 Contributing

We welcome contributions! Please see our Design Principles for coding guidelines:

Simplicity: Write simple, straightforward code
Readability: Prioritize clarity over cleverness
Testability: Create testable components
Minimal Changes: Only modify code related to the task
Less Code = Less Debt: Minimize code footprint

📝 Citation

If you use LMMs Engine in your research, please cite:

@software{lmms_engine2024,
  title={LMMs Engine: A simple, unified multimodal framework for pretraining and finetuning.},
  author={LMMs-Lab},
  year={2024},
  url={https://github.com/LMMs-Lab/lmms-engine}
}

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

🔗 Links

GitHub: https://github.com/LMMs-Lab/lmms-engine
LMMs-Lab: https://lmms-lab.com
Documentation: docs/
Issues: https://github.com/LMMs-Lab/lmms-engine/issues

Built with ❤️ by LMMs-Lab

⭐ Star us on GitHub to support the project! ⭐

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Oct 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmms_engine-0.1.2.tar.gz (535.7 kB view details)

Uploaded Oct 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lmms_engine-0.1.2-py3-none-any.whl (355.5 kB view details)

Uploaded Oct 25, 2025 Python 3

File details

Details for the file lmms_engine-0.1.2.tar.gz.

File metadata

Download URL: lmms_engine-0.1.2.tar.gz
Upload date: Oct 25, 2025
Size: 535.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for lmms_engine-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`efbf0fc0b8afdf6a6759b66a52dcadac3c58521ca14af6027f3a0cbcfe0e0796`
MD5	`36a8151b62f3879274a77cbe2813252c`
BLAKE2b-256	`aace93de60f3686ffed28a58135b2acbf22d10ffb5674867b1fd3144e2c34872`

See more details on using hashes here.

File details

Details for the file lmms_engine-0.1.2-py3-none-any.whl.

File metadata

Download URL: lmms_engine-0.1.2-py3-none-any.whl
Upload date: Oct 25, 2025
Size: 355.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for lmms_engine-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a6b7221fc6968a095377e0e8872e65f91236f5a195080249e8a85cca33b9356`
MD5	`dc0d64292151ca3cddb514dbbabf0e4a`
BLAKE2b-256	`367bf94ae04873bf7141cf81c6e6000a967d9a87e664d99c939ad746e0cf0a93`

See more details on using hashes here.

lmms_engine 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

LMMs-Engine

A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.

Overview

🚀 Quick Start

Installation

Launch Training

🔥 Featured Examples

🤖 Model Support

Multimodal Models

Diffusion & Generative Models

Language Models

⚡️ Optimizations

Core Distributed Training

Memory & Compute Optimizations

Advanced Optimizer

Data Pipeline

Configuration Examples

📖 Documentation

Step-by-Step Workflow

Comprehensive Guides

🏗️ Codebase Architecture

Component Registry

Training Pipeline

Supported Trainers

🎯 Use Cases

🤝 Contributing

📝 Citation

📄 License

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes