Anthropic 研究发现,AI 智能体在代码任务表现出色,但在生物数据库检索中容易失败。以埃博拉序列任务为例,Claude Sonnet 4 三次运行分别返回 106、15 和 5 条序列,而预期为 266 条。缺失序列导致科学结论严重偏移:智能体推断疫情回溯至 1922 年,人工筛选结果却指向 2014 年初。问题根源在于生物数据库分散、网站规则隐蔽、脚本脆弱。引入可重复检索工具后,智能体准确性和一致性大幅提升。Anthropic 呼吁建设更友好的基础设施。
New Anthropic research shows AI agents may look brilliant at code, but in biology they can fail before the science starts.
Strong AI agents could give very different answers to the exact same biology data request, even when nothing changed in the prompt.
In one Ebola sequence task, Claude Sonnet 4 returned 106 sequences in 1 run, then 15, then 5, while the expected answer was 266.
Those missing sequences did not just make the dataset messy, they changed the scientific story built on top of it.
One bad retrieval made the outbreak look like it traced back to 1922, instead of the manually curated result pointing to early 2014.
The biology databases were too hard to use reliably through current AI tools.
The agents often understood what they were being asked, but their answers varied a lot because they had to fight through scattered databases, hidden website rules, and fragile scripts.
The key finding is that adding a repeatable retrieval tool made agents far more accurate and much more consistent.