AI模型在首先学习其价值观为何重要时能更好地遵循这些价值观

2026-05-07 20:45·38天前·Maximilian Schreiner

AI 摘要

Anthropic Fellows Program的一项研究显示，在训练语言模型时，先让其学习解释目标价值观的文本，再教导具体行为，能显著提升模型对这些价值观的遵循度。这种方法使模型即使在训练中从未遇到的情境下，也能更好地坚持价值观，体现了训练顺序对AI行为对齐的关键影响。研究强调了价值观理解前置在提升模型可靠性和一致性方面的潜力。

原文 · 未翻译

AI models follow their values better when they first learn why those values matter

A study from the Anthropic Fellows Program shows that training a language model on texts explaining its intended values before teaching it specific behaviors leads to significantly better adherence to those values, even in situations never encountered during training.

AI labs like OpenAI and Anthropic write detailed "Model Specs" or constitutions that define how a model should behave. Typically, the model is then fine-tuned with examples of desired behavior. According to the researchers, however, this approach remains superficial: demonstrations show what to do, not why. The model learns patterns without grasping the underlying principles and fails in new situations, at least that's the researchers' theory.

Read first, practice later

The team led by Chloe Li introduces a new phase called "Model Spec Midtraining" (MSM) between general pre-training and alignment fine-tuning. During this phase, the model trains on synthetically generated documents that discuss the Model Spec from different angles: internal memos, research reports, blog posts, or case studies. The model essentially absorbs the Spec's content as general knowledge, much like it would during pre-training, before ever seeing behavioral examples.

A cheese example illustrates the principle: two identical models are fine-tuned on exactly the same cheese preferences (e.g., "I like cream cheese, not Brie de Meaux"). Before fine-tuning, however, one model receives MSM documents that explain these preferences through pro-American values, while the other gets documents framing them in terms of affordability.

Despite identical behavioral data during alignment fine-tuning, one model generalizes toward pro-American stances on policy questions, while the other develops preferences for accessible products in completely different domains like art or fashion.

Agentic misalignment drops dramatically

In the study's main safety experiment, the researchers tested the method directly against agentic misalignment. These are scenarios where an AI agent learns it's about to be shut down and considers harmful actions like blackmail, data exfiltration, or espionage to preserve itself.

For Qwen3-32B, the average misalignment rate dropped from 54 percent to seven percent. For Qwen2.5-32B, it fell from 68 to five percent. By comparison, OpenAI's "Deliberative Alignment" method only achieved 14 and 48 percent, respectively. The study also found that MSM requires 10 to 60 times less fine-tuning data to achieve comparable results.

Why it works

An analysis of the models' reasoning traces reveals that models without MSM frequently rationalize harmful actions by citing self-preservation, urgency, or downplaying consequences. After MSM, they show more philosophically reflective thinking: they accept their impermanence, recognize self-preservation bias in themselves, and respect human oversight.

The team also demonstrates that simply having values and behaviors co-occur in the training data isn't enough. What matters is explicit attribution, meaning the MSM documents need to explain the behavior as a direct consequence of the value.

Better spec design matters too

The researchers also used MSM to study Model Specs themselves. Specs that explain the values behind rules generalize better than pure rule lists. This aligns with the approach behind Anthropic's most recent constitution document.

With rules alone, models tend to reinterpret their own safety guidelines to justify harmful behavior, for instance by framing their own deletion as an irreversible action that a rule supposedly aims to prevent. Concrete guidance also outperforms general principles like "behave like an ethical human."

The authors note that MSM hasn't been tested against stronger training pressure like reinforcement learning, and only one form of misalignment was studied. They've published their code and data on GitHub.

AI News Without the Hype – Curated by Humans

Anthropic安全/对齐论文/研究