UniRL把扩散和LLM的强化学习塞进同一个训练循环,外加两个新算法,多模态对齐的研究者可以立刻fork代码试起来。
腾讯混元推出UniRL,一个支持统一多模态模型的强化学习基础设施,并发布两个新算法DRPO和Flow-DPPO。UniRL通过单个后训练循环(生成→评分→优势→更新→同步)覆盖扩散/流匹配模型、LLM/VLM及统一多模态模型(如Hunyuan-Image 3和Bagel)。模型与算法作为独立轴,可实现模型×算法的组合覆盖。框架支持可插拔rollout引擎(训练侧/SGLang/vLLM-Omni)、FSDP2分片和三种部署模式。FlowDPPO针对流/扩散模型引入基于精确散度的信任域策略优化;DRPO为LLM RL提供平滑的优势加权二次正则化方法。代码已开源。
🚀Introducing UniRL, an RL infra for unified multimodal models. Together with two new RL algorithms: DRPO and Flow-DPPO.
One RL loop across diffusion/flow matching models, LLMs/VLMs, and unified multimodal models👇
Code: http://github.com/Tencent-Hunyuan/UniRL
(yes - U(you)-ni-(need) RL 😉)
1、Most RL stacks are built for one modality. UniRL applies a single post-training loop - generate → score → advantage → update → sync - across model families. Model and algorithm are two independent axes, so your coverage is the model × algorithm product, not a fixed recipe menu.
2、One loop, every modality: text→image, text/image→video, vision-language, text-only LLM and VLM, the LLM→diffusion prompt-enhancer, and unified autoregressive+diffusion generation (Hunyuan-Image 3 and Bagel) - a model class no single-purpose RL repo can even express.
3、Built to scale: pluggable rollout engines (train-side / SGLang / vLLM-Omni) behind one typed contract, FSDP2 sharding, and three deployment modes from a single config knob.
4、Two team-original algorithms headline the release:
FlowDPPO: Policy optimization for flow/diffusion models with trust-region masks based on exact divergence (See our paper: Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models https://github.com/Tencent-Hunyuan/UniRL/blob/main/FlowDPPO/HY_FlowDPPO.pdf)
DRPO: LLM RL with a smooth, advantage-weighted quadratic regularizer (See our paper: Rethinking the Divergence Regularization in LLM RL 【https://arxiv.org/abs/2606.09821】)