新Anthropic研究:自然语言自动编码器。 像Claude这样的模型用语言交流,但用数字思考。这些数字——称为激活值——编码了Claude的思维,但并非以人类可读的语言呈现。 在此研究中,我们训练Claude将其激活值翻译成人类可读的文本。
New Anthropic research: Natural Language Autoencoders.
Models like Claude talk in words but think in numbers. The numbers-called activations-encode Claude's thoughts, but not in a language we can read.
Here, we train Claude to translate its activations into human-readable text.