腾讯混元发布UniRL：统一多模态强化学习基础设施

Tencent Hy@TencentHunyuan

精选67

2026-06-09 19:45·5天前

精选理由

UniRL把扩散和LLM的强化学习塞进同一个训练循环，外加两个新算法，多模态对齐的研究者可以立刻fork代码试起来。

AI 摘要

腾讯混元推出UniRL，一个支持统一多模态模型的强化学习基础设施，并发布两个新算法DRPO和Flow-DPPO。UniRL通过单个后训练循环（生成→评分→优势→更新→同步）覆盖扩散/流匹配模型、LLM/VLM及统一多模态模型（如Hunyuan-Image 3和Bagel）。模型与算法作为独立轴，可实现模型×算法的组合覆盖。框架支持可插拔rollout引擎（训练侧/SGLang/vLLM-Omni）、FSDP2分片和三种部署模式。FlowDPPO针对流/扩散模型引入基于精确散度的信任域策略优化；DRPO为LLM RL提供平滑的优势加权二次正则化方法。代码已开源。

🚀Introducing UniRL， an RL infra for unified multimodal models. Together with two new RL algorithms： DRPO and Flow-DPPO.

One RL loop across diffusion/flow matching models， LLMs/VLMs， and unified multimodal models👇

Code： http：//github.com/Tencent-Hunyuan/UniRL

（yes - U（you）-ni-（need） RL 😉）

1、Most RL stacks are built for one modality. UniRL applies a single post-training loop - generate → score → advantage → update → sync - across model families. Model and algorithm are two independent axes， so your coverage is the model × algorithm product， not a fixed recipe menu.

2、One loop， every modality： text→image， text/image→video， vision-language， text-only LLM and VLM， the LLM→diffusion prompt-enhancer， and unified autoregressive+diffusion generation （Hunyuan-Image 3 and Bagel） - a model class no single-purpose RL repo can even express.

3、Built to scale： pluggable rollout engines （train-side / SGLang / vLLM-Omni） behind one typed contract， FSDP2 sharding， and three deployment modes from a single config knob.

4、Two team-original algorithms headline the release：

FlowDPPO： Policy optimization for flow/diffusion models with trust-region masks based on exact divergence （See our paper： Flow-DPPO： Divergence Proximal Policy Optimization for Flow Matching Models https：//github.com/Tencent-Hunyuan/UniRL/blob/main/FlowDPPO/HY_FlowDPPO.pdf）

DRPO： LLM RL with a smooth， advantage-weighted quadratic regularizer （See our paper： Rethinking the Divergence Regularization in LLM RL 【https：//arxiv.org/abs/2606.09821】）

多模态开源/仓库论文/研究部署/工程

在 X 查看原推

Tencent Hy@TencentHunyuan · X