LLM安全法官在不同安全标准与危害类别上判定不一致

Rohan Paul@rohanpaul_ai

2026-06-11 10:15·6天前

AI 摘要

一项新研究指出，用大语言模型评判其他模型回答是否安全的“LLM安全法官”存在严重不稳定：将相同回答翻译或改写后，法官可能给出不同安全判定。在暴力、极端内容等明显危害场景下表现较好，但在需结合上下文判断的金融建议、信用评估、文化敏感回复等场景中可靠性显著下降。不同法官之间也常出现分歧，高原始一致性有时会掩盖低真实可靠性——因为许多法官默认选择同一标签。论文标题为“LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories”。

LLM judges can change their safety verdict when the same answer is translated or rewritten.

The problem is that many AI teams now use LLMs to judge whether another model's answer is safe， but safety is not always a simple yes or no question.

Those judges can be shaky exactly where careful judgment matters most.

The paper proposes a stress test where the same basic answer is shown to judges after translation or rewriting， then the researchers check whether the judges still give the same safety verdict.

They are better when harm is obvious， as in violent or extremist content， because the cues are loud and familiar.

They become much weaker when safety depends on context， judgment， and regulation， as in financial advice， creditworthiness， or culturally sensitive responses.

They also disagreed with each other a lot， and high raw agreement sometimes hid weak real reliability because many judges kept choosing the same label by default.

----

Link - arxiv. org/abs/2605.31381

Title： "LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories"

arXiv安全/对齐论文/研究评测/基准

在 X 查看原推

Rohan Paul@rohanpaul_ai · X