Skip to main content

Unified multimodal RL training framework

Project description

UniRL — A Reinforcement Learning Framework for Unified Multimodal Models

A Reinforcement Learning Framework for Unified Multimodal Models

U(you)·ni(need)·RL for unified multimodal intelligence

Python License Documentation WeChat

News 🚀

  • [2026-05] DRPO released — "Rethinking the Divergence Regularization in LLM RL" (arXiv).
  • [2026-06] Flow-DPPO released — "FlowDPPO: Divergence Proximal Policy Optimization for Flow Matching Models" (paper).

About 💡

UniRL applies one RL post-training loop — generate samples, score them, compute advantages, update the policy, and sync weights back to rollout workers — across multimodal model families.

UniRL architecture

UniRL is a layered, composable system. Each entrypoint (train_diffusion, train_ar, train_pe, train_unified_model) loads a Hydra example config covering model, algorithm, rollout, reward, placement, and sync, then creates the matching domain trainer (DiffusionTrainer, ARTrainer, PETrainer, UnifiedModelTrainer). The trainer coordinates the RL loop across pluggable rollout engines, algorithms, model bundles, reward services, and the shared distributed runtime: Ray DevicePool, FSDP, Transfer Queue (TQ), and LoRA/full-weight sync. See unirl/README.md for the runtime loop, deployment modes, and module map.

Team-Proposed Algorithms 🌟

🌟 These algorithms are proposed by our team — the highlight of UniRL. Each algorithm's folder holds a step-by-step tutorial and a runnable example recipe. We highly recommend trying them in our framework!

Algorithm Paper Tutorial Notes
Flow-DPPO "Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models" FlowDPPO/ Diffusion/flow RL with an exact divergence-based trust-region mask.
DRPO "Rethinking the Divergence Regularization in LLM RL" DRPO/ Token-level LLM RL with a smooth advantage-weighted quadratic regularizer.

UniRL also wires in standard reference algorithms — (LLM's)GRPO, DiffusionNFT, DanceGRPO, and MixGRPO — in unirl/algorithms/.

Model Support 🎨

Model and algorithm support are two independent dimensions that compose within a domain: any diffusion algorithm (see above) runs on a diffusion model, AR algorithms on AR models — so UniRL covers many more model × algorithm combinations than the shipped example recipes alone. The table below is the model dimension; all listed models are supported (✅).

Model Category Modality Status
Stable Diffusion 3 / 3.5 Image diffusion Text → Image
Qwen-Image Image diffusion Text → Image
FLUX.2-Klein Image diffusion Text → Image
WAN 2.1 Video diffusion Text / Image → Video
WAN 2.2 Video diffusion Text / Image → Video
HunyuanVideo 1.0 / 1.5 Video diffusion Text → Video
Qwen-VL Vision-language AR Text + Image → Text
Qwen3 LLM AR Text → Text
Prompt-enhancer LLM + diffusion Text → Text → Image
HunyuanImage3 Unified AR + diffusion Text → Image
Bagel Unified AR + diffusion Text → Image

Each model maps to a domain entrypoint (train_diffusion, train_ar, train_pe, train_unified_model); see Getting Started below to run any of them.

Training Modes 🧩

UniRL unifies four training modes, one Hydra example bucket and entrypoint each. Examples are self-contained YAML files selected with --config-name=<domain>/<example>:

Domain Trains Entrypoint Example
diffusion/ Image / video diffusion models train_diffusion diffusion/sd3_sglang_rollout_colocate
ar/ Autoregressive models — vision-language (VLM) + text-only (LLM) train_ar ar/qwen_vl_grpo_geo3k_mc_4x8, ar/qwen3_drpo_4b_base_dpao_sglang
pe/ Prompt-enhancer (AR rewriter + diffusion reward) train_pe pe/pe_sglang_full_pickscore
unified_model/ Unified AR + diffusion models train_unified_model unified_model/hi3_vllmomni

See examples/README.md for the full launch guide, naming schema, and how to add a recipe.

Getting Started ⚡

Install dependencies first — see INSTALL.md.

# compose-check, then launch a single-node example
python -m unirl.train_diffusion --config-name=diffusion/sd3_trainside --cfg job --resolve
bash examples/run_experiment_single_node.sh diffusion/sd3_trainside

Full launch guide — multi-node, every entrypoint, mooncake.

Roadmap 🗺️

We are actively expanding model and algorithm coverage. Near-term directions:

  • Broaden algorithm coverage for the newer model families — FLUX.2-Klein, HunyuanVideo 1.0 / 1.5, and Bagel.
  • Extend the team-proposed algorithms (Flow-DPPO, DRPO) to more model families.
  • Broaden reward backends and rollout-engine coverage across domains.

Want a model or algorithm prioritized? Open an issue to discuss.

Contributing 🤝

Contributions and questions are welcome. Before opening a pull request, read the repository conventions in AGENTS.md, run the pre-PR checks for the files you touched, and fill in the pull request template. For questions, bug reports, and feature requests, open an issue.

Acknowledgement 🙏

UniRL builds on ideas and infrastructure from the open-source RL and inference ecosystem. We especially thank vLLM, SGLang, slime, and verl.

Citation 📚

If you find UniRL helpful, please cite:

@misc{unirl_github,
  title        = {{UniRL: A Reinforcement Learning Framework for Unified Multimodal Models}},
  author       = {Haonan Wang and Linyu Wu and Qian Qiu and Lewei Jin and Bowen Ping and Jianghai Chen and Yiheng Du and Guangxin He and Yu Shi and Yongguang Lin and Zhuoxin Zhou and Zhanchao Zhou and Keming Wu and Rizhen Hu and Xuefei Ning and Lvfang Tao and Feiyu Hu and Xiangyan Liu and Siqi Kou and Jiarui Yao and Xiangxin Zhou and Liefeng Bo and Wenxi Zhu and Tianyu Pang},
  year         = {2026},
  howpublished = {\url{https://github.com/Tencent-Hunyuan/UniRL}},
  urldate      = {2026-06-05}
}

If you use DRPO, please also cite:

@misc{yao2026drpo,
  title         = {{Rethinking the Divergence Regularization in LLM RL}},
  author        = {Jiarui Yao and Xiangxin Zhou and Penghui Qi and Wee Sun Lee and Liefeng Bo and Tianyu Pang},
  year          = {2026},
  eprint        = {2606.09821},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2606.09821}
}

If you use Flow-DPPO, please also cite:

@misc{ping2026flowdppo,
  title        = {{Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models}},
  author       = {Bowen Ping and Xiangxin Zhou and Penghui Qi and Minnan Luo and Liefeng Bo and Tianyu Pang},
  year         = {2026},
  howpublished = {\url{https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO}},
  note         = {Manuscript dated June 8, 2026}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unirl-0.1.0.tar.gz (763.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unirl-0.1.0-py3-none-any.whl (950.3 kB view details)

Uploaded Python 3

File details

Details for the file unirl-0.1.0.tar.gz.

File metadata

  • Download URL: unirl-0.1.0.tar.gz
  • Upload date:
  • Size: 763.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unirl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 21d672e4696574e444ab64fbbab11fb4db4be6c57ae2c340b461d313b3e21ff0
MD5 1000348ecc2bfba904eda711505f7aff
BLAKE2b-256 0ac17d0e929babb45419bd262839b6c74c74a5b33e18e5d2490890ee8788a9df

See more details on using hashes here.

Provenance

The following attestation bundles were made for unirl-0.1.0.tar.gz:

Publisher: publish.yml on Tencent-Hunyuan/UniRL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file unirl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: unirl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 950.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unirl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0bc1c15859374e35e500d8d7d0f2de89e2f57dd818386ec30d29b5e4fd6b07a3
MD5 6d937b57c46bd3bb68dd61ffb68570f7
BLAKE2b-256 fe840f3a10313e8336f26e380248b5e52545cf62187c6f2683f505b8838c5932

See more details on using hashes here.

Provenance

The following attestation bundles were made for unirl-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Tencent-Hunyuan/UniRL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page