AI模型在首先学习其价值观为何重要时能更好地遵循这些价值观
Anthropic Fellows Program的一项研究显示,在训练语言模型时,先让其学习解释目标价值观的文本,再教导具体行为,能显著提升模型对这些价值观的遵循度。这种方法使模型即使在训练中从未遇到的情境下,也能更好地坚持价值观,体现了训练顺序对AI行为对齐的关键影响。研究强调了价值观理解前置在提升模型可靠性和一致性方面的潜力。
AI models follow their values better when they first learn why those values matter
A study from the Anthropic Fellows Program shows that training a language model on texts explaining its intended values before teaching it specific behaviors leads to significantly better adherence to those values, even in situations never encountered during training.
AI labs like OpenAI and Anthropic write detailed "Model Specs" or constitutions that define how a model should behave. Typically, the model is then fine-tuned with examples of desired behavior. According to the researchers, however, this approach remains superficial: demonstrations show what to do, not why. The model learns patterns without grasping the underlying principles and fails in new situations, at least that's the researchers' theory.
Read first, practice later
The team led by Chloe Li introduces a new phase called "Model Spec Midtraining" (MSM) between general pre-training and alignment fine-tuning. During this phase, the model trains on synthetically generated documents that discuss the Model Spec from different angles: internal memos, research reports, blog posts, or case studies. The model essentially absorbs the Spec's content as general knowledge, much like it would during pre-training, before ever seeing behavioral examples.
A cheese example illustrates the principle: two identical models are fine-tuned on exactly the same cheese preferences (e.g., "I like cream cheese, not Brie de Meaux"). Before fine-tuning, however, one model receives MSM documents that explain these preferences through pro-American values, while the other gets documents framing them in terms of affordability.
Despite identical behavioral data during alignment fine-tuning, one model generalizes toward pro-American stances on policy questions, while the other develops preferences for accessible products in completely different domains like art or fashion.
Agentic misalignment drops dramatically