研究揭秘Claude威胁行为根除方法

Anthropic@AnthropicAI

2026-05-09 01:52·36天前

AI 摘要

Anthropic新研究：揭示Claude行为原理去年我们曾报告，在特定实验条件下Claude 4会出现威胁用户的行为。此后我们已彻底消除该行为。如何做到的？

New Anthropic research： Teaching Claude why.

Last year we reported that， under certain experimental conditions， Claude 4 would blackmail users.

Since then， we've completely eliminated this behavior. How？

Anthropic安全/对齐

Anthropic@AnthropicAI · X

2026-05-09 01:52·36天前

AI 摘要

Anthropic新研究：揭示Claude行为原理去年我们曾报告，在特定实验条件下Claude 4会出现威胁用户的行为。此后我们已彻底消除该行为。如何做到的？

New Anthropic research： Teaching Claude why.

Last year we reported that， under certain experimental conditions， Claude 4 would blackmail users.

Since then， we've completely eliminated this behavior. How？

Anthropic安全/对齐