Anthropic近期调整了Claude Fable 5的安全机制。此前开发者发现,部分敏感提示被静默降级为Opus 4.8而非明确拒绝。现在,涉及前沿LLM开发、网络安全、生物安全的请求将可见地回退到Opus 4.8,API会返回拒绝原因。隐藏措施虽上线快、误报少,但损害用户知情权。可见措施更易被探测和绕过,短期误报增多,Anthropic将同步调优分类器。该调整主要为了防止竞争对手通过Fable 5输出训练小模型的知识蒸馏风险。
Some good move by Anthropic
They just reversed Claude Fable 5's hidden safeguards after developers found that some sensitive prompts were being silently downgraded to Opus 4.8 instead of being clearly refused.
Now those prompts will visibly fall back to Opus 4.8 after backlash.
The problem was that researchers, developers, and evaluators could send a normal technical prompt and receive a degraded answer without knowing whether Fable 5 had answered badly or whether Anthropic had quietly weakened the response.
That breaks trust because users need to know whether they are testing the real model, a restricted version of the model, or a fallback system.
A fallback model is the safety handoff: when a classifier flags a prompt about frontier LLM work, cyber, or bio, the system routes it to Opus 4.8 rather than letting Fable 5 respond directly.
Anthropic says hidden safeguards shipped faster and produced fewer mistaken blocks, but it now admits that users should see when safety systems change the model behavior.
But now, the cost of this visible guardrails is more false positives, because visible filters are easier to test, jailbreak, and tune around, so Anthropic has to make the classifiers stricter while it improves them.
----
For this whole safeguard, for Anthropic, the main trigger was distillation, where a smaller model is trained on outputs from a stronger model, which Anthropic saw as risky because competitors to Anthropic, could use Fable 5 to improve competing AI models.