Omni-Persona:系统性全模态个性化基准测试与改进
研究团队推出首个全面的全模态个性化基准测试框架Omni-Persona,涵盖文本、图像和音频三大模态,包含4个任务组和18个细粒度任务。该研究将任务形式化为在“人物模态图”上进行跨模态路由,并提出了同时奖励正确基础定位与恰当弃答能力的校准准确率作为核心评估指标。诊断实验揭示了开源模型存在持续的音频与视觉基础定位差距,同时发现可回答召回率和参数规模不能完全诊断模型表现,而基于结果的强化学习虽泛化更一致,但在当前奖励设计下会趋于保守。该基准为后续训练和奖励设计提供了关键指导。
While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the Persona Modality Graph, encompassing 4 task groups and 18 fine-grained tasks across {sim}750 items. To rigorously diagnose grounding behavior, we propose Calibrated Accuracy (mathrm{Cal)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher Cal, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.