通过强化学习将分布感知注入多模态大语言模型以解决深度不平衡回归问题
针对多模态大语言模型在长尾目标分布下数值回归表现不佳的问题,本研究提出一种基于分布感知的强化学习框架。该方法通过群组相对策略优化,引入基于一致性相关系数的奖励机制,在批次层面提供基于比较的监督,使模型预测分布与真实分布在相关性、尺度和均值上对齐。该即插即用框架无需修改模型架构。在统一的长尾回归基准测试中,该方法相比监督微调和现有回归方法取得了持续改进,尤其在中等样本和少样本场景下提升显著。
Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.