叛逆学生:通过反转教师信号实现自蒸馏推理探索
传统自蒸馏方法在指导学生模型时,会覆盖其成功推理的路径,抑制其自主推理能力。本研究提出一种反向解读自蒸馏信号的新视角:当学生模型在教师模型未预测的路径上成功推理时,这些标记被视为其自主推理的体现。基于此,团队推出了RLRT方法,该方法在GRPO基础上强化正确生成轨迹中的此类标记,将其定义为一种基于学生自身成功的有价值探索,而非均匀多样性探索。在多个版本的Qwen3模型上,RLRT均显著超越了传统自蒸馏和基于探索的基线方法,确立了信息不对称作为强化学习与价值回归框架中一个新的原则性设计维度。
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.