G-Zero:从零数据出发的开放式生成自我博弈框架
针对大语言模型在开放域任务中依赖外部评判导致的能力瓶颈与奖励破解问题,研究团队提出无需验证器的协同进化框架G-Zero。其核心是Hint-δ内在奖励机制,通过量化生成模型在有无自生成提示条件下预测结果的偏移,为自我改进提供信号。在此驱动下,提议模型持续生成挑战性查询与提示以针对生成模型的盲点,生成模型则内化这些提示引导的改进。理论分析表明,在理想条件下,该框架具有最佳迭代次优性保证。G-Zero完全从内部动态获取监督,绕开了外部评判者的能力上限,为不可验证领域的持续模型进化提供了可扩展且稳健的路径。
Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-δ, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.